nhslogo CS4132 Data Analytics

Analyzing the Impact of Government Expenditure on Education on Economic and Psychological Indicators by Chandranshu Nanda¶

Important Note: Please keep your report concise and relevant (i.e. show only relevant steps and visualizations used to answer your research questions).

Table of Contents¶

  • Analyzing the Impact of Government Expenditure on Education on Economic and Psychological Indicators by Chandranshu Nanda
    • Table of Contents
    • Motivation and Background
      • Background
        • Global Context
        • Singapore's Context
        • Current State
      • Motivation
        • Post-Pandemic Recovery and Resilience
        • Economic Uncertainty and Global Competition
        • Policy Development and Innovation
        • Conclusion
    • Summary of Research Questions & Results
    • Data Source
    • Methodology
      • Setup
      • Data Acquisition
        • Cleaning MXXXXXX.csv
        • Dataset 1
        • Dataset 2
        • Dataset 3
        • Dataset 4
        • Dataset 5
        • Dataset 6
        • Dataset 7
        • Dataset 8
        • Dataset 9
        • Dataset 10
        • Dataset 11
        • Dataset 12
        • Dataset 13
        • Dataset 14
        • Dataset 15
        • Dataset 16
        • Dataset 17
        • Dataset 18
        • Dataset 19
        • Dataset 20
        • Total Records
      • Data Cleaning & EDA
        • Question 1
          • Overview
          • Cleaning
          • Exploratory Analysis
          • Quantitative Analysis
          • Findings
        • Question 2
          • Overview
          • Cleaning
          • Exploratory Analysis
          • Quantitative Analysis
          • Findings
        • Question 3
          • Overview
          • Cleaning
          • Model Creation and Evaluation
          • Model Application
          • Findings
    • Results Findings & Conclusion
      • Question 1
      • Question 2
      • Question 3
      • Conclusion
    • Recommendations or Further Works
      • Areas for Improvement
      • Further Research
    • References
      • Prompts Used
      • Sharing link of entire chat transcript

Motivation and Background¶

Give an overview of the project, motivation, background and goals. This is basically the writeup from your proposal, with improvements based on feedback given.

Background¶

Global Context¶

Education has long been recognized as a cornerstone for social and economic development. Globally, governments invest heavily in education, viewing it as a means to foster economic growth, reduce poverty, and promote social equity. The link between education and economic development was notably emphasized in the early 20th century with the rise of human capital theory, which posits that investment in education enhances individual productivity and, by extension, national economic performance.

Singapore's Context¶

Singapore provides a unique case study in the global landscape of education. Since gaining independence in 1965, Singapore has consistently prioritized education as a key driver of its economic transformation. The government’s strategic investments in education have played a pivotal role in transforming the nation from a developing country to a global economic powerhouse within a few decades.

The Singaporean government allocates a significant portion of its budget to education, aiming to provide high-quality education that equips its citizens with the skills necessary for the country’s knowledge-based economy. Policies such as the "Teach Less, Learn More" initiative, the emphasis on STEM education, and the creation of autonomous universities reflect Singapore’s commitment to continually evolving its educational system to meet the demands of a globalized world.

The outcomes of these investments are evident: Singapore consistently ranks at the top of international assessments like PISA (Programme for International Student Assessment), and its workforce is highly skilled and competitive.

Current State¶

Today, the global focus on education continues to evolve. Governments worldwide are increasingly recognizing the need to invest not only in traditional education but also in areas like mental health, digital literacy, and lifelong learning. The COVID-19 pandemic has further highlighted the importance of resilient and adaptable education systems.

In Singapore, education remains at the forefront of national policy. The government continues to adapt its education strategies to prepare its citizens for the challenges of the future, such as the digital economy and the need for innovation.

Motivation¶

This research on the impact of government expenditure on education, particularly in relation to economic and psychological indicators, is crucial and timely for several reasons:

Post-Pandemic Recovery and Resilience¶

The COVID-19 pandemic has profoundly disrupted educational systems worldwide, exacerbating existing inequalities and challenging the effectiveness of traditional education models. As countries, including Singapore, navigate post-pandemic recovery, understanding how education expenditure can contribute to both economic resilience and psychological well-being is vital. This research can inform policies that ensure education systems are robust and adaptable, helping societies recover more effectively from global disruptions.

Economic Uncertainty and Global Competition¶

In a rapidly changing global economy, where technological advancements and digitalization are reshaping industries, investing in education is more important than ever. Governments are under pressure to produce a workforce that is not only skilled but also adaptable to future economic shifts. This research can provide insights into how strategic investment in education can drive economic growth and maintain a competitive edge in the global market, particularly for countries like Singapore that rely on a knowledge-based economy.

Policy Development and Innovation¶

For policymakers, understanding the multifaceted impacts of education spending is essential for developing informed and effective policies. This research can provide evidence-based insights that guide the allocation of resources towards educational initiatives that yield the greatest economic and psychological benefits. In Singapore and beyond, such insights are crucial for ensuring that education systems evolve to meet contemporary challenges and opportunities.

Conclusion¶

In a world that is increasingly interconnected and facing unprecedented challenges, this research is necessary to explore how government expenditure on education can be optimized to support not only economic development but also the psychological well-being of populations. By addressing these critical issues now, the research can contribute to shaping education policies that are both forward-looking and responsive to current global trends.

Summary of Research Questions & Results¶

Repeat your research questions in a numbered list. After each research question, clearly state the answer/conclusion you determined. Do not give details or justifications yet — just the answer.
  1. What is the quantitative impact of correlation between government expenditure on education on and GDP growth?

    • A country's GDP has a very strong correlation with its Expenditure on Education
    • The Percentage of GDP a country decides to invest in education has a correlation with its GDP per capita. As a country gets bigger and more developed, this correlation gets stronger.
  2. To what extent does government expenditure on education reduce income inequality, as measured by the Gini coefficient?

    • A country's Educational Expenditure per Capita has a strong negative correlation on its Gini Index
    • All countries with Gini Index greater than the 75th percentile have very low Educational Expenditure per Capita
  3. What predictive trends can be identified regarding the future impact of current education spending in Singapore on the psychological economical well-being of its population?

    • As Educational Expenditure increases, predicted GDP per Capita in 2073 also increases.
      • For every increase of 1% in Educational Expenditure (% GDP), the forecasted GDP per Capita in 2073 increases by $1382
    • As Educational Expenditure increases, predicted Gini Index in 2073 decreases.
      • For every increase of 1% in Educational Expenditure (% GDP), the forecasted Gini Index in 2073 decreases by 2
    • At Singapore's current Educational Expenditure (% GDP) (2.15%), the forecasted GDP per Capita in 2073 is $88507 USD
      • If Singapore were to spend the mean Educational Expenditure (% GDP) (4%), the forecasted GDP per Capita in 2073 is $90299 USD
    • At Singapore's current Educational Expenditure (% GDP) (2.15%), the forecasted Gini Index in 2073 is 32.5
      • If Singapore were to spend the mean Educational Expenditure (% GDP) (4%), the forecasted GDP per Capita in 2073 is 29.0

Data Source¶

Numbered list of dataset (with downloadable links pointing to the exact dataset you will be using).

For each link, briefly describe the dataset that will be used in the project.

The data must be real — neither you nor someone else may make up the data.

  1. https://www.iban.com/country-codes

    • (Global), ISO Codes
  2. https://tablebuilder.singstat.gov.sg/table/TS/M850011#

    • (SG) Government Expenditure On Education, from 1960 - 2022
  3. https://tablebuilder.singstat.gov.sg/table/TS/M850581#

    • (SG) Residents Aged 25 Years & Over By Highest Qualification Attained, Sex And Age Group, from 1990 - 2023
  4. https://tablebuilder.singstat.gov.sg/table/TS/M850671#

    • (SG) Proportion Of Students Achieving At Least A Minimum Proficiency In Reading And Mathematics By Level And Sex, from 1995 - 2022
  5. https://tablebuilder.singstat.gov.sg/table/TS/M850691#

    • (SG) Total Net Enrolment Rate, Gender Parity Index, And Pupil-Teacher Ratio For Primary And Secondary Education, from 2016 - 2022
  6. https://tablebuilder.singstat.gov.sg/table/TS/M183401

    • (SG) Average Resident Unemployment Rate By Sex, Age And Highest Qualification Attained, from 1992 - 2023
  7. https://tablebuilder.singstat.gov.sg/table/TS/M015651

    • (SG) Gross Domestic Product At Current Prices, By Industry (SSIC 2020), from 1975 - 2024
  8. https://tablebuilder.singstat.gov.sg/table/CT/17892

    • (SG), Gini Coefficient Among Resident Employed Households, from 2000 - 2023
  9. https://data.uis.unesco.org/

    • (Global), Demographic and socio-economic: Socio-economic indicators, from 2017 - 2023
  10. https://data.uis.unesco.org/

    • (Global), Other policy relevant indicators : Government expenditure on education in US$, from 2017 - 2023
  11. https://worldhappiness.report/data/

    • (Global), World Happiness Report Appendices & Data, from 2005 - 2023
  12. https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS?_gl=1

    • (Global), Government expenditure on education, total (% of GDP)
  13. https://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG?_gl=1

    • (Global), GDP growth (annual %)
  14. https://data.worldbank.org/indicator/NY.GDP.MKTP.CD

    • (Global), GDP (Current US$)
  15. https://data.worldbank.org/indicator/SP.POP.TOTL?name_desc=false

    • (Global), Total Population
  16. https://data.worldbank.org/indicator/SI.POV.GINI?year=1999

    • (Global), Gini Index
  17. https://data.worldbank.org/indicator/SI.DST.10TH.10

    • (Global), Income share held by highest 10%
  18. https://data.worldbank.org/indicator/SI.DST.05TH.20

    • (Global), Income share held by highest 20%
  19. https://data.worldbank.org/indicator/SI.DST.FRST.20

    • (Global), Income share held by lowest 20%
  20. https://data.worldbank.org/indicator/SI.DST.FRST.10

    • (Global), Income share held by lowest 10%

Methodology¶

You should demonstrate the data science life cycle here (from data acquisition to cleaning to EDA and analysis etc).

Setup¶

Below, we demonstrate the setup needed to run the following sections.

Below, we install all the required libraries.

In [3]:
! pip install pandas
! pip install matplotlib
! pip install numpy
! pip install requests
! pip install beautifulsoup4
! pip install ipython
! pip install plotly
! pip install seaborn
! pip install scipy
! pip install scikit-learn
! pip install nbconvert
Requirement already satisfied: pandas in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (2.2.3)
Requirement already satisfied: numpy>=1.23.2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from pandas) (2.1.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from pandas) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from pandas) (2024.2)
Requirement already satisfied: six>=1.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: matplotlib in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (3.9.2)
Requirement already satisfied: contourpy>=1.0.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (1.3.0)
Requirement already satisfied: cycler>=0.10 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (4.54.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (1.4.7)
Requirement already satisfied: numpy>=1.23 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (2.1.1)
Requirement already satisfied: packaging>=20.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (24.1)
Requirement already satisfied: pillow>=8 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (3.1.4)
Requirement already satisfied: python-dateutil>=2.7 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Requirement already satisfied: numpy in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (2.1.1)
Requirement already satisfied: requests in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (2.32.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from requests) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from requests) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from requests) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from requests) (2024.8.30)
Requirement already satisfied: beautifulsoup4 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (4.12.3)
Requirement already satisfied: soupsieve>1.2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from beautifulsoup4) (2.6)
Requirement already satisfied: ipython in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (8.27.0)
Requirement already satisfied: decorator in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (5.1.1)
Requirement already satisfied: jedi>=0.16 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (0.19.1)
Requirement already satisfied: matplotlib-inline in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (0.1.7)
Requirement already satisfied: prompt-toolkit<3.1.0,>=3.0.41 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (3.0.48)
Requirement already satisfied: pygments>=2.4.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (2.18.0)
Requirement already satisfied: stack-data in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (0.6.3)
Requirement already satisfied: traitlets>=5.13.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (5.14.3)
Requirement already satisfied: typing-extensions>=4.6 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (4.12.2)
Requirement already satisfied: pexpect>4.3 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from ipython) (4.9.0)
Requirement already satisfied: parso<0.9.0,>=0.8.3 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jedi>=0.16->ipython) (0.8.4)
Requirement already satisfied: ptyprocess>=0.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from pexpect>4.3->ipython) (0.7.0)
Requirement already satisfied: wcwidth in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from prompt-toolkit<3.1.0,>=3.0.41->ipython) (0.2.13)
Requirement already satisfied: executing>=1.2.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from stack-data->ipython) (2.1.0)
Requirement already satisfied: asttokens>=2.1.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from stack-data->ipython) (2.4.1)
Requirement already satisfied: pure-eval in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from stack-data->ipython) (0.2.3)
Requirement already satisfied: six>=1.12.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from asttokens>=2.1.0->stack-data->ipython) (1.16.0)
Requirement already satisfied: plotly in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (5.24.1)
Requirement already satisfied: tenacity>=6.2.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from plotly) (9.0.0)
Requirement already satisfied: packaging in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from plotly) (24.1)
Requirement already satisfied: seaborn in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from seaborn) (2.1.1)
Requirement already satisfied: pandas>=1.2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from seaborn) (2.2.3)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from seaborn) (3.9.2)
Requirement already satisfied: contourpy>=1.0.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.0)
Requirement already satisfied: cycler>=0.10 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.54.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.7)
Requirement already satisfied: packaging>=20.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.1)
Requirement already satisfied: pillow>=8 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.4)
Requirement already satisfied: python-dateutil>=2.7 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from pandas>=1.2->seaborn) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from pandas>=1.2->seaborn) (2024.2)
Requirement already satisfied: six>=1.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
Requirement already satisfied: scipy in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (1.14.1)
Requirement already satisfied: numpy<2.3,>=1.23.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from scipy) (2.1.1)
Requirement already satisfied: scikit-learn in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (1.5.2)
Requirement already satisfied: numpy>=1.19.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from scikit-learn) (2.1.1)
Requirement already satisfied: scipy>=1.6.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from scikit-learn) (1.14.1)
Requirement already satisfied: joblib>=1.2.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from scikit-learn) (3.5.0)
Requirement already satisfied: nbconvert in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (7.16.4)
Requirement already satisfied: beautifulsoup4 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (4.12.3)
Requirement already satisfied: bleach!=5.0.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (6.1.0)
Requirement already satisfied: defusedxml in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (0.7.1)
Requirement already satisfied: jinja2>=3.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (3.1.4)
Requirement already satisfied: jupyter-core>=4.7 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (5.7.2)
Requirement already satisfied: jupyterlab-pygments in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (0.3.0)
Requirement already satisfied: markupsafe>=2.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (2.1.5)
Requirement already satisfied: mistune<4,>=2.0.3 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (3.0.2)
Requirement already satisfied: nbclient>=0.5.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (0.10.0)
Requirement already satisfied: nbformat>=5.7 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (5.10.4)
Requirement already satisfied: packaging in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (24.1)
Requirement already satisfied: pandocfilters>=1.4.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (1.5.1)
Requirement already satisfied: pygments>=2.4.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (2.18.0)
Requirement already satisfied: tinycss2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (1.3.0)
Requirement already satisfied: traitlets>=5.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbconvert) (5.14.3)
Requirement already satisfied: six>=1.9.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from bleach!=5.0.0->nbconvert) (1.16.0)
Requirement already satisfied: webencodings in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from bleach!=5.0.0->nbconvert) (0.5.1)
Requirement already satisfied: platformdirs>=2.5 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jupyter-core>=4.7->nbconvert) (4.3.6)
Requirement already satisfied: jupyter-client>=6.1.12 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbclient>=0.5.0->nbconvert) (7.4.9)
Requirement already satisfied: fastjsonschema>=2.15 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbformat>=5.7->nbconvert) (2.20.0)
Requirement already satisfied: jsonschema>=2.6 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from nbformat>=5.7->nbconvert) (4.23.0)
Requirement already satisfied: soupsieve>1.2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from beautifulsoup4->nbconvert) (2.6)
Requirement already satisfied: attrs>=22.2.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (24.2.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (2023.12.1)
Requirement already satisfied: referencing>=0.28.4 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (0.35.1)
Requirement already satisfied: rpds-py>=0.7.1 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (0.20.0)
Requirement already satisfied: entrypoints in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (0.4)
Requirement already satisfied: nest-asyncio>=1.5.4 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (1.6.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (2.9.0.post0)
Requirement already satisfied: pyzmq>=23.0 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (24.0.1)
Requirement already satisfied: tornado>=6.2 in /Users/chandranshunanda/Desktop/WebDev/CS4132/.venv/lib/python3.11/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (6.4.1)

Below, we import all the necessary libraries

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import requests
from bs4 import BeautifulSoup
from IPython.display import display
import plotly.express as px
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error

Below, we do the setup for Plotly. Since Plotly uses Javascript, we must change the renderer when we export to HTML. Additionally, the notebook must be connected to wifi for the Plotly to work as intended.

In [89]:
import plotly.io as pio
pio.renderers.default = 'browser' # Change to 'browser' before exporting to html.

Data Acquisition¶

Display the data which will be used in the project.

The data should be saved in .xlsx or .csv format to be submitted with the project. If webscraping has been done to obtain your data, save your webscraping code in another jupyter notebook as appendix to be submitted separately from the report.

Import and display each dataset in a dataframe.

For each dataset, give a brief overview of the data it contains, and explain the meaning of columns that are relevant to the project.

Cleaning MXXXXXX.csv¶

The following function loads and cleans CSV files from https://tablebuilder.singstat.gov.sg/

In [6]:
def load_and_clean_MXXXXXX(filename, skiprows, drop_last_rows):
    # Load the CSV file, skipping the specified number of rows at the beginning
    df = pd.read_csv(f'./data/{filename}.csv', skiprows=skiprows, index_col=0)

    # Transpose the DataFrame and drop the specified number of rows from the end
    df = df.T.iloc[:, :-drop_last_rows]
    return df

Dataset 1¶

This dataset covers ISO Codes. We will obtain the data via web scraping.

In [7]:
url = "https://www.iban.com/country-codes"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
table = soup.find("table", {"class": "table"})
headers = [header.text for header in table.find_all("th")]

rows = []
for row in table.find_all("tr")[1:]:
    cells = row.find_all("td")
    row_data = [cell.text for cell in cells]
    rows.append(row_data)

country_codes_df = pd.DataFrame(rows, columns=headers)
country_codes_dict = country_codes_df.set_index('Alpha-3 code')[['Country']].to_dict()['Country']
country_codes_df
Out[7]:
Country Alpha-2 code Alpha-3 code Numeric
0 Afghanistan AF AFG 004
1 Åland Islands AX ALA 248
2 Albania AL ALB 008
3 Algeria DZ DZA 012
4 American Samoa AS ASM 016
... ... ... ... ...
244 Wallis and Futuna WF WLF 876
245 Western Sahara EH ESH 732
246 Yemen YE YEM 887
247 Zambia ZM ZMB 894
248 Zimbabwe ZW ZWE 716

249 rows × 4 columns

Dataset 2¶

M850011.csv contains all of the information related to government expenditure on education in Singapore over multiple years. This file includes a detailed breakdown of the government's education spending, allowing for analysis of trends and impacts across different education levels and time periods.

In [8]:
df1 = load_and_clean_MXXXXXX('M850011', 9, 31)
df1
Out[8]:
Data Series Total Government Expenditure On Education (Thousand Dollars) Operating Expenditure (Thousand Dollars) Primary Schools (Thousand Dollars) Secondary Schools & Junior Colleges (Thousand Dollars) Institute Of Technical Education (Thousand Dollars) Tertiary (Thousand Dollars) Universities (Thousand Dollars) National Institute Of Education (Thousand Dollars) Polytechnics (Thousand Dollars) Others (Thousand Dollars) Development Expenditure (Thousand Dollars) Proportion Of Government Expenditure On Education To GDP (Per Cent)
2022 13060530 12890289 3143296 3200498 491147 4304619 3047510 125976 1131133 1750729 170241 na
2021 12909908.0 12604777.0 2944535.0 3084162.0 498915.0 4225708.0 2939737.0 123306.0 1162665.0 1851457.0 305131.0 2.2
2020 12259853.0 11766853.0 2674257.0 2880658.0 470521.0 4155545.0 2906300.0 122227.0 1127018.0 1585872.0 493000.0 2.5
2019 12723276.0 11932177.0 2738444.0 3054241.0 473599.0 4307399.0 3008764.0 124176.0 1174459.0 1358494.0 791099.0 2.5
2018 12875992.0 12429006.0 2823567.0 3176690.0 489278.0 4608243.0 3243605.0 105071.0 1259567.0 1331228.0 446986.0 2.5
... ... ... ... ... ... ... ... ... ... ... ... ...
1964 120229 103358 61337 21212 3376 12735 9255 1630 1850 4698 16872 na
1963 107888 94644 58581 17206 2362 12277 8744 1533 2000 4218 13244 na
1962 94791 82307 50835 14751 1217 11248 8335 1318 1595 4256 12484 na
1961 74557 67650 43908 12466 1233 6684 3689 1200 1795 3359 6907 na
1960 61403 57100 37259 10409 765 5816 3144 1440 1233 2850 4303 na

63 rows × 12 columns

Dataset 3¶

M850581.csv contains all of the information related to population distribution by age and gender in Singapore, particularly focusing on the breakdown of educational attainment levels across different age groups. This file provides a comprehensive overview of how different educational attainment levels are distributed across various age groups and genders in Singapore.

In [9]:
df2 = load_and_clean_MXXXXXX('M850581', 10, 21)
df2
Out[9]:
Data Series Total 25 - 29 Years 30 - 34 Years 35 - 39 Years 40 - 44 Years 45 - 49 Years 50 - 54 Years 55 - 59 Years 60 - 64 Years 65 Years & Over ... Total Females - University 25 - 29 Years 30 - 34 Years 35 - 39 Years 40 - 44 Years 45 - 49 Years 50 - 54 Years 55 - 59 Years 60 - 64 Years 65 Years & Over
2023 3100900.0 238000.0 307600.0 298300.0 313600.0 310700.0 317800.0 312500.0 292400.0 710100.0 ... 573500.0 76800.0 103600.0 94600.0 93000.0 73800.0 56400.0 35500.0 19600.0 20300.0
2022 3085300.0 215400.0 305500.0 308600.0 333100.0 343400.0 304100.0 310700.0 295300.0 669200.0 ... 552400.0 68500.0 99100.0 92400.0 94600.0 76900.0 50800.0 32200.0 16500.0 21400.0
2021 2978300.0 254800.0 325600.0 310400.0 309100.0 302000.0 260600.0 286200.0 302500.0 627000.0 ... 534200.0 83500.0 108400.0 93600.0 85100.0 63300.0 40500.0 24800.0 17000.0 17900.0
2020 2977800.0 268900.0 296000.0 297800.0 298100.0 311500.0 294200.0 306300.0 286000.0 619000.0 ... 496600.0 86900.0 94600.0 87600.0 78600.0 60700.0 37600.0 23600.0 13700.0 13300.0
2019 2922700.0 252300.0 273200.0 296900.0 304500.0 323000.0 311900.0 316100.0 266200.0 578600.0 ... 470500.0 77400.0 83800.0 88300.0 78000.0 61200.0 34400.0 23000.0 10700.0 13600.0
2018 2877000.0 251800.0 263800.0 303400.0 313100.0 313200.0 316600.0 299000.0 274800.0 541400.0 ... 440400.0 76000.0 79000.0 86300.0 73500.0 53200.0 31300.0 18100.0 12000.0 10900.0
2017 2846000.0 242300.0 262300.0 305700.0 320000.0 313700.0 307800.0 285000.0 261700.0 547500.0 ... 420500.0 72200.0 81500.0 84600.0 68900.0 51400.0 28700.0 15600.0 8400.0 9200.0
2016 2795200.0 253000.0 268200.0 288900.0 304000.0 301100.0 321700.0 299500.0 261400.0 497300.0 ... 398700.0 76900.0 80200.0 76600.0 62100.0 43800.0 27800.0 14900.0 7600.0 8900.0
2015 2750900.0 232800.0 269900.0 300200.0 315300.0 301500.0 319400.0 296700.0 248200.0 467000.0 ... 374800.0 67200.0 76800.0 77700.0 60800.0 39100.0 23600.0 13900.0 7900.0 8000.0
2014 2732600.0 231700.0 276900.0 287900.0 305100.0 298600.0 318300.0 296400.0 246200.0 471500.0 ... 362400.0 67100.0 78200.0 70500.0 56900.0 38300.0 23900.0 12900.0 7300.0 7200.0
2013 2669100.0 223700.0 274600.0 297500.0 310300.0 315100.0 318900.0 280200.0 229400.0 419500.0 ... 350000.0 63900.0 79100.0 72200.0 54400.0 33800.0 22800.0 12700.0 6200.0 5000.0
2012 2626400.0 223200.0 271700.0 302300.0 310800.0 319200.0 313200.0 272100.0 219700.0 394100.0 ... 320700.0 60200.0 75400.0 68700.0 50500.0 28600.0 17100.0 11000.0 4800.0 4500.0
2011 2601900.0 230900.0 280100.0 306900.0 302300.0 321700.0 309600.0 267400.0 209400.0 373700.0 ... 307300.0 61200.0 74900.0 65200.0 45100.0 27700.0 15000.0 8700.0 4800.0 4700.0
2010 2576000.0 249400.0 289300.0 317300.0 303500.0 319600.0 303600.0 252800.0 196400.0 344100.0 ... 288100.0 64700.0 72100.0 60400.0 38000.0 24400.0 13300.0 7500.0 4200.0 3500.0
2009 2523000.0 237800.0 280200.0 312900.0 311900.0 319000.0 298400.0 233300.0 182100.0 347400.0 ... 266400.0 60400.0 69500.0 57100.0 33300.0 21200.0 12000.0 6400.0 3300.0 3300.0
2008 2441300.0 228000.0 272900.0 307900.0 314400.0 316000.0 290000.0 228100.0 160300.0 323700.0 ... 246300.0 55500.0 64700.0 51700.0 32200.0 21000.0 10200.0 5800.0 2400.0 2800.0
2007 2448400.0 231800.0 288000.0 304300.0 317600.0 324300.0 284900.0 229700.0 144000.0 323600.0 ... 226300.0 52500.0 62500.0 45400.0 28100.0 16600.0 10800.0 5100.0 2700.0 2600.0
2006 2408600.0 231200.0 282000.0 298600.0 321800.0 322600.0 276000.0 224000.0 130800.0 321500.0 ... 212700.0 52800.0 56500.0 41200.0 27300.0 16400.0 8800.0 4800.0 2300.0 2500.0
2005 2308500.0 235700.0 288000.0 302800.0 319400.0 300300.0 256800.0 201200.0 120000.0 284300.0 ... 180500.0 48800.0 49400.0 34400.0 22400.0 11900.0 7000.0 3400.0 1600.0 1700.0
2004 2282400.0 230000.0 281700.0 292400.0 322400.0 303000.0 250500.0 190300.0 127900.0 284100.0 ... 172800.0 47000.0 44300.0 32200.0 22700.0 13300.0 6000.0 3800.0 1600.0 1900.0
2003 2263000.0 232400.0 287200.0 308500.0 314800.0 293800.0 243300.0 170500.0 132400.0 280100.0 ... 158600.0 46000.0 42000.0 28800.0 18000.0 11900.0 6200.0 2900.0 1400.0 1400.0
2002 2198900.0 233400.0 281000.0 320100.0 313900.0 281500.0 235100.0 150100.0 127700.0 256100.0 ... 140000.0 42500.0 39100.0 27200.0 13400.0 9000.0 4700.0 2000.0 1200.0 800.0
2001 2126900.0 242800.0 278000.0 318800.0 311700.0 267900.0 223400.0 121000.0 121600.0 241900.0 ... 121200.0 37000.0 32200.0 23000.0 13300.0 7800.0 3900.0 1900.0 1300.0 800.0
2000 2074000.0 250500.0 279500.0 310700.0 304500.0 256800.0 204900.0 125000.0 110700.0 231300.0 ... 105100.0 35000.0 27100.0 18800.0 11200.0 6400.0 3400.0 1600.0 900.0 800.0
1999 2067300.0 251500.0 293400.0 309900.0 300300.0 251800.0 186100.0 131900.0 113400.0 229000.0 ... 94100.0 31700.0 24800.0 15900.0 10200.0 5700.0 2900.0 1500.0 800.0 700.0
1998 2020800.0 242200.0 290000.0 307700.0 299600.0 252200.0 171400.0 132100.0 102000.0 223700.0 ... 87500.0 28400.0 21800.0 15400.0 9900.0 6000.0 2900.0 1200.0 900.0 900.0
1997 1969700.0 248300.0 293000.0 303400.0 286200.0 244500.0 154900.0 133000.0 99100.0 207300.0 ... 75500.0 24700.0 19800.0 12700.0 8600.0 5300.0 2200.0 1400.0 400.0 500.0
1996 1916400.0 240700.0 298200.0 305800.0 278700.0 233900.0 130900.0 127800.0 95600.0 204600.0 ... 69100.0 23500.0 18800.0 11000.0 7200.0 4200.0 2000.0 1300.0 400.0 800.0
1995 1860900.0 239900.0 292400.0 301200.0 264400.0 211300.0 132800.0 121700.0 97200.0 200000.0 ... 54300.0 18400.0 14700.0 9400.0 5400.0 3200.0 1700.0 800.0 300.0 500.0
1994 1816800.0 256100.0 291000.0 291500.0 248300.0 190400.0 142200.0 118900.0 96000.0 182500.0 ... 49600.0 17300.0 12800.0 7900.0 5300.0 2700.0 1600.0 1000.0 500.0 500.0
1993 1768400.0 255300.0 283400.0 278300.0 248900.0 175700.0 141900.0 113700.0 95100.0 176100.0 ... 44500.0 15800.0 11500.0 6600.0 4900.0 2900.0 1600.0 600.0 300.0 300.0
1992 1730600.0 271400.0 292200.0 266700.0 232100.0 155100.0 145000.0 108300.0 93200.0 166600.0 ... 37100.0 13400.0 10100.0 6200.0 3200.0 1900.0 1000.0 800.0 100.0 300.0
1991 1679300.0 272100.0 288300.0 259900.0 226500.0 137700.0 136300.0 107200.0 89600.0 161700.0 ... 31400.0 12600.0 7400.0 4700.0 3400.0 1600.0 900.0 500.0 300.0 100.0
1990 1622300.0 281500.0 292700.0 252600.0 203900.0 127500.0 117500.0 99800.0 82800.0 164100.0 ... 28900.0 10900.0 7400.0 4700.0 3000.0 1400.0 800.0 300.0 200.0 200.0

34 rows × 180 columns

Dataset 4¶

M850671.csv contains all of the information related to the proficiency levels of students in primary and secondary education in Singapore, specifically focusing on their achievements in reading and mathematics. This file provides a detailed view of the educational outcomes in terms of reading and mathematics proficiency among students in Singapore's primary and secondary education systems.

In [10]:
df3=load_and_clean_MXXXXXX('M850671', 10, 24)
df3
Out[10]:
Data Series Proportion Of Students In Primary Education Achieving At Least A Minimum Proficiency Level In Reading Male Female Proportion Of Students In Secondary Education Achieving At Least A Minimum Proficiency Level In Reading Male Female Proportion Of Students In Primary Education Achieving At Least A Minimum Proficiency Level In Mathematics Male Female Proportion Of Students In Secondary Education Achieving At Least A Minimum Proficiency Level In Mathematics Male Female
2022 na na na 88.83 86.32 91.44 na na na 91.98 91.53 92.44
2021 96.66 95.46 97.93 na na na na na na na na na
2019 na na na na na na 95.51 95.07 95.98 91.82 90.47 93.23
2018 na na na 88.75 85.77 91.86 na na na na na na
2016 97.25 96.4 98.16 na na na na na na na na na
2015 na na na 88.85 86.34 91.53 93.39 92.73 94.1 93.55 92.27 94.96
2012 na na na 90.13 86.58 93.84 na na na na na na
2011 96.79 95.75 97.9 na na na 93.87 92.94 94.85 92.48 90.41 94.61
2009 na na na 87.52 83.83 91.34 na na na na na na
2007 na na na na na na 91.75 90.38 93.22 88.17 85.79 90.79
2006 96.65 95.54 97.98 na na na na na na na na na
2003 na na na na na na 91.2 89.81 92.65 92.85 91.53 94.29
2001 90.21 87.55 93.07 na na na na na na na na na
1999 na na na na na na na na na 93.87 93.11 94.68
1995 na na na na na na 84.98 83.41 86.74 95.34 95.37 95.39

Dataset 5¶

M850691.csv contains all of the information related to net enrolment rates in primary and secondary education in Singapore, with a focus on gender parity and overall enrolment metrics. This file provides insights into the enrolment rates and gender parity in education within Singapore, allowing for analysis of trends in educational access across different demographics.

In [11]:
df4=load_and_clean_MXXXXXX('M850691', 9, 29)
df4
Out[11]:
Data Series Total Net Enrolment Rate - Primary Education (Total) (Per Cent) Total Net Enrolment Rate - Primary Education (Male) (Per Cent) Total Net Enrolment Rate - Primary Education (Female) (Per Cent) Total Net Enrolment Rate - Secondary Education (Total) (Per Cent) Total Net Enrolment Rate - Secondary Education (Male) (Per Cent) Total Net Enrolment Rate - Secondary Education (Female) (Per Cent) Total Net Enrolment Rate - Lower Secondary Education (Total) (Per Cent) Total Net Enrolment Rate - Lower Secondary Education (Male) (Per Cent) Total Net Enrolment Rate - Lower Secondary Education (Female) (Per Cent) Total Net Enrolment Rate - Upper Secondary Education (Total) (Per Cent) Total Net Enrolment Rate - Upper Secondary Education (Male) (Per Cent) Total Net Enrolment Rate - Upper Secondary Education (Female) (Per Cent) Gender Parity Index For Total Net Enrolment Rate, Primary Education (Ratio Expressed) Gender Parity Index For Total Net Enrolment Rate, Secondary Education (Ratio Expressed) Gender Parity Index For Total Net Enrolment Rate, Lower Secondary Education (Ratio Expressed) Gender Parity Index For Total Net Enrolment Rate, Upper Secondary Education (Ratio Expressed) Pupil-Teacher Ratio In Primary School (Per Cent) Pupil-Teacher Ratio In Secondary School (Per Cent)
2022 99.1 98.9 99.2 99.6 99.7 99.5 100.0 99.9 100.0 98.3 98.5 98.0 1.0 1.0 1.00 0.99 14.0 11.9
2021 99.4 99.4 99.3 99.1 99.0 99.2 99.5 99.4 99.7 98.6 98.5 98.7 1.0 1.0 1.00 1.00 13.8 11.6
2020 99.9 99.8 100.0 99.0 99.2 98.8 99.1 99.2 99.0 98.9 99.2 98.6 1.0 1.0 1.00 0.99 13.9 11.5
2019 99.9 99.8 100.0 99.7 99.7 99.7 99.7 99.6 99.8 99.7 99.9 99.6 1.0 1.0 1.00 1.00 14.2 11.3
2018 99.8 99.7 99.9 99.8 100.0 99.7 99.4 99.8 99.1 100.0 99.9 100.0 1.0 1.0 0.99 1.00 14.3 11.3
2017 100.0 99.9 100.0 100.0 100.0 100.0 99.9 100.0 99.9 99.9 99.8 100.0 1.0 1.0 1.00 1.00 14.7 11.5
2016 100.0 100.0 100.0 100.0 99.9 100.0 99.9 100.0 99.9 99.9 99.8 100.0 1.0 1.0 1.00 1.00 15.1 11.7

Dataset 6¶

M183401.csv contains all of the information related to educational attainment levels and their corresponding proportions within different age groups and genders in Singapore. This file provides a detailed breakdown of how different educational qualifications are distributed across various age groups and genders in Singapore, enabling analysis of trends in educational attainment within the population.

In [12]:
df5=load_and_clean_MXXXXXX('M183401', 10, 21)
df5
Out[12]:
Data Series Total Sex: Male Sex: Female Age Group (Years): Below 30 Age Group (Years): 25 - 29 Age Group (Years): 30 - 39 Age Group (Years): 40 - 49 Age Group (Years): 50 & Over Age Group (Years): 50 - 59 Age Group (Years): 60 & Over Age Group (Years): 25 & Over Age Group (Years): 15 - 24 Sex: Male Sex: Female Highest Qualification Attained: Below Secondary Highest Qualification Attained: Secondary Highest Qualification Attained: Post-Secondary (Non-Tertiary) Highest Qualification Attained: Diploma & Professional Qualification Highest Qualification Attained: Degree
2023 2.7 2.7 2.7 5.2 4.3 2.1 2.3 2.4 2.5 2.2 2.5 6.7 5.4 8.2 2.4 2.9 3.2 2.8 2.6
2022 2.9 2.8 3.0 4.7 4.0 2.3 2.5 2.8 2.7 2.9 2.7 5.9 4.0 8.2 2.5 3.1 3.4 3.0 2.8
2021 3.5 3.3 3.8 5.7 4.7 2.6 3.2 3.4 3.5 3.4 3.3 7.3 5.0 10.0 3.1 4.1 4.5 3.9 3.2
2020 4.1 3.9 4.4 7.5 5.5 3.0 3.4 3.9 3.9 3.7 3.7 10.6 8.1 13.5 4.1 4.6 4.9 4.7 3.5
2019 3.1 3.0 3.4 5.9 4.6 2.2 2.5 2.9 3.1 2.7 2.8 7.7 5.8 10.0 2.9 3.5 3.6 3.3 2.9
2018 2.9 2.9 3.0 5.2 4.1 2.2 2.3 2.6 2.7 2.5 2.6 6.6 5.0 8.6 2.5 2.8 3.4 3.1 2.9
2017 3.1 3.1 3.1 5.4 4.3 2.4 2.5 2.8 3.0 2.5 2.8 7.0 5.2 9.1 2.6 2.8 3.8 3.3 3.2
2016 3.0 3.0 3.0 5.0 3.8 2.3 2.4 2.7 2.7 2.6 2.7 6.5 4.6 8.8 2.4 3.1 3.2 3.1 3.1
2015 2.8 2.7 2.9 5.1 3.6 1.9 2.4 2.4 2.4 2.3 2.4 6.7 5.6 8.1 2.5 2.7 3.0 2.9 2.8
2014 2.7 2.7 2.9 5.2 4.1 2.2 2.1 2.3 2.2 2.4 2.4 6.4 4.9 8.1 2.5 2.8 2.7 2.9 2.9
2013 2.8 2.7 2.9 5.2 3.9 2.2 2.2 2.3 2.3 2.2 2.4 6.7 5.2 8.6 2.4 2.9 3.5 2.7 2.8
2012 2.8 2.7 3.1 5.1 3.5 2.2 2.1 2.4 2.2 2.8 2.4 6.7 5.3 8.2 2.6 2.9 3.5 2.8 2.7
2011 2.9 2.6 3.2 5.0 3.4 2.4 2.1 2.5 2.5 2.5 2.5 6.7 4.8 9.0 2.8 3.5 3.2 2.7 2.6
2010 3.1 3.0 3.4 5.5 3.9 2.3 2.6 2.7 2.6 2.9 2.7 7.4 5.5 9.6 3.4 3.3 3.8 2.9 2.8
2009 4.3 4.1 4.7 6.7 5.0 3.6 3.8 3.9 4.0 3.7 3.9 8.8 6.6 11.5 4.9 4.7 4.8 3.9 3.6
2008 3.2 3.0 3.5 5.2 3.7 2.6 2.7 2.9 3.1 2.3 2.8 7.0 5.4 8.9 3.9 3.3 3.7 2.8 2.7
2007 3.0 2.8 3.1 4.5 3.0 2.4 2.4 2.9 2.9 2.6 2.6 6.3 4.6 8.2 3.2 3.4 2.7 2.9 2.4
2006 3.6 3.4 3.7 5.4 3.8 3.0 3.0 3.4 3.5 2.8 3.2 7.1 6.1 8.3 4.2 3.9 3.5 3.2 2.8
2005 4.1 3.9 4.5 6.2 4.4 3.1 3.6 4.1 4.2 3.8 3.7 8.3 6.2 10.8 5.2 4.6 4.1 3.2 3.1
2004 4.4 4.5 4.4 6.7 5.1 3.5 3.8 4.3 4.6 3.0 4.0 8.6 6.0 11.6 5.2 4.6 3.8 4.2 3.6
2003 5.2 5.1 5.3 7.3 5.8 4.5 4.9 4.5 5.0 2.8 4.8 9.3 6.7 12.3 6.3 5.3 4.2 4.7 4.4
2002 4.8 4.8 4.8 6.6 5.0 3.9 4.6 4.4 4.8 3.3 4.4 8.8 5.5 12.3 5.8 4.8 3.8 4.1 4.1
2001 3.7 3.9 3.4 5.1 4.3 3.1 3.4 3.5 3.8 2.3 3.4 6.2 4.6 7.8 4.7 3.5 2.9 3.4 2.9
2000 3.7 3.9 3.5 4.6 3.6 3.1 3.7 3.4 3.6 2.7 3.4 5.9 4.7 7.1 5.2 3.6 2.4 2.4 2.6
1999 3.8 3.8 3.8 5.2 4 3.2 3.4 3.4 3.7 2.7 3.4 6.8 na na 4.8 3.6 2.9 2.6 3.1
1998 3.5 3.5 3.4 5.1 3.8 2.8 3.2 2.7 3.3 1 3.1 6.7 na na 4.3 3.4 1.9 2.9 2.8
1997 2 2 1.9 3.2 2.3 1.5 1.6 1.5 1.5 1.4 1.7 4.3 na na 2.4 1.7 1.6 1.8 1.7
1996 2.2 2.3 2 3.4 2.7 1.7 1.6 1.9 2.1 1.3 1.9 4.2 na na 2.5 2 1.5 1.7 2.6
1995 2.2 2.3 2.1 3.6 2.5 1.7 1.6 1.8 2 1.2 1.8 4.8 na na 2.7 2 1.5 1.9 2.1
1994 2.2 2.2 2.1 3.6 2.4 1.6 1.5 1.4 1.6 1 1.7 4.8 na na 2.5 1.8 1.4 2 2.6
1993 2.1 2.1 2 3.3 2.3 1.6 1.4 1.5 1.6 1.2 1.7 4.3 na na 2.4 2 1.2 1.7 2.3
1992 2.2 2 2.5 3.3 2 1.7 1.7 1.3 1.4 1.2 1.7 4.6 na na 2.5 2 1.6 1.5 2.2

Dataset 7¶

M015651.csv contains all of the information related to the Gross Domestic Product (GDP) of Singapore, broken down by industry sector, including goods-producing industries and services-producing industries. This file provides a detailed breakdown of GDP contributions by various sectors of the Singapore economy, enabling analysis of economic performance across different industries.

In [13]:
df6=load_and_clean_MXXXXXX('M015651', 10, 31)
df6
Out[13]:
Data Series GDP At Current Market Prices Goods Producing Industries Manufacturing Construction Utilities Other Goods Industries Services Producing Industries Wholesale & Retail Trade Wholesale Trade Retail Trade ... Administrative & Support Services Other Services Industries Public Administration & Defence Education Health & Social Services Arts, Entertainment & Recreation Other Services - Others Ownership Of Dwellings Gross Value Added At Basic Prices Add: Taxes On Products
2024 2Q 180428.1 39146.9 31050.7 5708.6 2340.9 46.7 125263.6 40108.1 38084.5 2023.6 ... 4532.7 14774.4 3283.2 3724.3 4456.0 1521.3 1789.6 6581.6 170992.1 9436.0
2024 1Q 172917.4 38286.4 30079.4 5865.9 2292.4 48.7 119386.3 34139.4 32036.1 2103.3 ... 4753.1 17496.0 4993.1 4597.6 4174.2 1904.7 1826.4 6360.5 164033.2 8884.2
2023 4Q 176412.1 36945.0 28355.3 6112.7 2423.6 53.4 124243.1 44710.6 42513.2 2197.4 ... 4639.9 16168.3 4330.8 4202.1 4134.7 1723.6 1777.1 6079.8 167267.9 9144.2
2023 3Q 168373.8 37946.2 29832.9 5600.7 2460.8 51.8 115633.9 38452.0 36420.8 2031.2 ... 4427.5 14778.4 3463.6 3869.1 4138.6 1641.6 1665.5 5898.9 159479.0 8894.8
2023 2Q 167351.6 38067.5 30281.6 5311.4 2429.2 45.3 115418.3 38151.6 36136.2 2015.4 ... 4382.1 13839.0 3157.0 3576.8 4065.9 1368.3 1671.0 5729.5 159215.3 8136.3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1976 1Q 3545.1 1166.6 744.8 271.7 73.4 76.7 2154.2 677.5 na na ... na 357.1 na na na na na 88.6 3409.4 135.7
1975 4Q 3693 1269.6 810.1 294.9 83.2 81.4 2206.6 724.8 na na ... na 402.4 na na na na na 86.7 3562.9 130.1
1975 3Q 3519.3 1205.7 785.1 272.6 73.5 74.5 2105.2 716 na na ... na 344.7 na na na na na 86 3396.9 122.4
1975 2Q 3307 1090.9 692.9 256.3 71.2 70.5 2019.6 704.9 na na ... na 324.1 na na na na na 83.5 3194 113
1975 1Q 3209.4 1014.2 654.9 228.5 61.3 69.5 1991.7 672.8 na na ... na 328.1 na na na na na 81.2 3087.1 122.3

198 rows × 29 columns

Dataset 8¶

17892.csv contains all of the information related to the Gini coefficient of Singapore, specifically focusing on household income inequality before and after accounting for government transfers and taxes. This file provides insights into how government policies such as taxes impact income inequality in Singapore, as reflected by changes in the Gini coefficient.

In [14]:
df7=load_and_clean_MXXXXXX('17892', 10, 27)
df7
Out[14]:
Before / After Accounting for Government Transfers and Taxes Household Income from Work Per Household Member (Including Employer CPF Contributions) Household Income from Work Per Household Member (Including Employer CPF Contributions) After Accounting for Government Transfers and Taxes1/
2023 0.433 0.371
2022 0.437 0.378
2021 0.444 0.385
2020 0.452 0.375
2019 0.452 0.398
2018 0.458 0.403
2017 0.459 0.402
2016 0.458 0.401
2015 0.463 0.409
2014 0.464 0.411
2013 0.463 0.409
2012 0.478 0.432
2011 0.473 0.423
2010 0.472 0.425
2009 0.471 0.422
2008 0.474 0.424
2007 0.482 0.439
2006 0.470 0.418
2005 0.465 0.422
2004 0.460 0.419
2003 0.457 0.422
2002 0.454 0.414
2001 0.454 0.419
2000 0.442 0.414
Unnamed: 25 NaN NaN

Dataset 9¶

The below file contains all of the information related to various socioeconomic indicators, across regions and time.

In [15]:
df8=pd.read_csv('./data/DEMO_DS_23082024021945252.csv')
df8
Out[15]:
DEMO_IND Indicator LOCATION Country TIME Time Value Flag Codes Flags
0 NY_GDP_MKTP_CN GDP (current LCU) AUS Australia 2017 2017 1.758828e+12 NaN NaN
1 NY_GDP_MKTP_CN GDP (current LCU) AUS Australia 2018 2018 1.842635e+12 NaN NaN
2 NY_GDP_MKTP_CN GDP (current LCU) AUS Australia 2019 2019 1.946613e+12 NaN NaN
3 NY_GDP_MKTP_CN GDP (current LCU) AUS Australia 2020 2020 1.979471e+12 NaN NaN
4 NY_GDP_MKTP_CN GDP (current LCU) AUS Australia 2021 2021 2.080419e+12 NaN NaN
... ... ... ... ... ... ... ... ... ...
27109 NY_GDP_DEFL_ZS GDP deflator (base year varies by country) FRO Faeroe Islands 2017 2017 1.142000e+02 NaN NaN
27110 NY_GDP_DEFL_ZS GDP deflator (base year varies by country) FRO Faeroe Islands 2018 2018 1.138000e+02 NaN NaN
27111 NY_GDP_DEFL_ZS GDP deflator (base year varies by country) FRO Faeroe Islands 2019 2019 1.176000e+02 NaN NaN
27112 NY_GDP_DEFL_ZS GDP deflator (base year varies by country) FRO Faeroe Islands 2020 2020 1.175000e+02 NaN NaN
27113 NY_GDP_DEFL_ZS GDP deflator (base year varies by country) FRO Faeroe Islands 2021 2021 1.201000e+02 NaN NaN

27114 rows × 9 columns

Dataset 10¶

The below file contains all of the information related to Government expenditure on education, across various regions and time.

In [16]:
df9=pd.read_csv('./data/NATMON_DS_23082024022406407.csv')
df9
Out[16]:
NATMON_IND Indicator LOCATION Country TIME Time Value Flag Codes Flags
0 X_US_1_FSGOV Government expenditure on primary education, U... UKR Ukraine 2017 2017 1316.29505 NaN NaN
1 X_US_1_FSGOV Government expenditure on primary education, U... UKR Ukraine 2018 2018 1595.33510 NaN NaN
2 X_US_1_FSGOV Government expenditure on primary education, U... UKR Ukraine 2019 2019 1857.26190 NaN NaN
3 X_US_1_FSGOV Government expenditure on primary education, U... UKR Ukraine 2020 2020 1840.77686 NaN NaN
4 X_US_5T8_FSGOV Government expenditure on tertiary education, ... MLI Mali 2017 2017 100.66726 NaN NaN
... ... ... ... ... ... ... ... ... ...
3776 X_US_2T3_FSGOV Government expenditure on secondary education,... UZB Uzbekistan 2020 2020 1116.55417 NaN NaN
3777 X_US_2T3_FSGOV Government expenditure on secondary education,... UZB Uzbekistan 2021 2021 1050.16985 NaN NaN
3778 X_US_2T4_V_FSGOV Government expenditure on secondary and post-s... USA United States of America 2020 2020 5044.48273 NaN NaN
3779 X_US_2T4_V_FSGOV Government expenditure on secondary and post-s... ARG Argentina 2020 2020 NaN a Category not applicable
3780 X_US_2T4_V_FSGOV Government expenditure on secondary and post-s... ARG Argentina 2021 2021 NaN a Category not applicable

3781 rows × 9 columns

Dataset 11¶

The below file contains all of the information related to 'happiness' (which we will take as an indicator of mental health), and time

In [17]:
df10=pd.read_csv('./data/DataForTable2 (1).csv', encoding='unicode_escape')
df10
Out[17]:
Country name year Life Ladder Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 Afghanistan 2008 3.724 7.350 0.451 50.500 0.718 0.164 0.882 0.414 0.258
1 Afghanistan 2009 4.402 7.509 0.552 50.800 0.679 0.187 0.850 0.481 0.237
2 Afghanistan 2010 4.758 7.614 0.539 51.100 0.600 0.118 0.707 0.517 0.275
3 Afghanistan 2011 3.832 7.581 0.521 51.400 0.496 0.160 0.731 0.480 0.267
4 Afghanistan 2012 3.783 7.661 0.521 51.700 0.531 0.234 0.776 0.614 0.268
... ... ... ... ... ... ... ... ... ... ... ...
2358 Zimbabwe 2019 2.694 7.698 0.759 53.100 0.632 -0.051 0.831 0.658 0.235
2359 Zimbabwe 2020 3.160 7.596 0.717 53.575 0.643 0.003 0.789 0.661 0.346
2360 Zimbabwe 2021 3.155 7.657 0.685 54.050 0.668 -0.079 0.757 0.610 0.242
2361 Zimbabwe 2022 3.296 7.670 0.666 54.525 0.652 -0.073 0.753 0.641 0.191
2362 Zimbabwe 2023 3.572 7.679 0.694 55.000 0.735 -0.069 0.757 0.610 0.179

2363 rows × 11 columns

Dataset 12¶

The below file contains all of the information related to governmental expenditure on education as a percentage of gdp

In [18]:
df11=pd.read_csv('./data/API_SE.XPD.TOTL.GD.ZS_DS2_en_csv_v2_68.csv', skiprows=4)
df11
Out[18]:
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2015 2016 2017 2018 2019 2020 2021 2022 2023 Unnamed: 68
0 Aruba ABW Government expenditure on education, total (% ... SE.XPD.TOTL.GD.ZS NaN NaN NaN NaN NaN NaN ... 5.887830 5.49106 NaN NaN NaN NaN NaN NaN NaN NaN
1 Africa Eastern and Southern AFE Government expenditure on education, total (% ... SE.XPD.TOTL.GD.ZS NaN NaN NaN NaN NaN NaN ... 4.750985 4.88207 4.820445 4.73975 4.511475 4.35244 4.67441 4.24412 NaN NaN
2 Afghanistan AFG Government expenditure on education, total (% ... SE.XPD.TOTL.GD.ZS NaN NaN NaN NaN NaN NaN ... 3.255800 4.54397 4.343190 NaN NaN NaN NaN NaN NaN NaN
3 Africa Western and Central AFW Government expenditure on education, total (% ... SE.XPD.TOTL.GD.ZS NaN NaN NaN NaN NaN NaN ... 3.195490 2.78355 3.535590 3.07200 3.100000 3.40400 3.28288 2.92850 NaN NaN
4 Angola AGO Government expenditure on education, total (% ... SE.XPD.TOTL.GD.ZS NaN NaN NaN NaN NaN NaN ... 3.100000 2.75500 2.467000 2.04500 2.073000 2.74500 2.29700 2.33200 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 Kosovo XKX Government expenditure on education, total (% ... SE.XPD.TOTL.GD.ZS NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
262 Yemen, Rep. YEM Government expenditure on education, total (% ... SE.XPD.TOTL.GD.ZS NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
263 South Africa ZAF Government expenditure on education, total (% ... SE.XPD.TOTL.GD.ZS NaN NaN NaN NaN NaN NaN ... 5.482850 5.44424 5.598670 5.64401 5.927700 6.18336 6.56206 6.18348 6.604 NaN
264 Zambia ZMB Government expenditure on education, total (% ... SE.XPD.TOTL.GD.ZS NaN NaN NaN NaN NaN NaN ... 4.624330 3.74792 3.729640 4.73975 4.418000 3.95000 3.10600 3.58300 NaN NaN
265 Zimbabwe ZWE Government expenditure on education, total (% ... SE.XPD.TOTL.GD.ZS NaN NaN NaN NaN NaN NaN ... 5.813000 5.47300 5.818780 2.05049 NaN NaN NaN NaN NaN NaN

266 rows × 69 columns

Dataset 13¶

The below file contains all of the information related to global gdp growth (annual %)

In [19]:
df12=pd.read_csv('./data/API_NY.GDP.MKTP.KD.ZG_DS2_en_csv_v2_59.csv', skiprows=4)
df12
Out[19]:
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2015 2016 2017 2018 2019 2020 2021 2022 2023 Unnamed: 68
0 Aruba ABW GDP growth (annual %) NY.GDP.MKTP.KD.ZG NaN NaN NaN NaN NaN NaN ... -0.623626 1.719625 7.048533 2.381730 -2.302836 -23.982581 27.639357 10.458317 NaN NaN
1 Africa Eastern and Southern AFE GDP growth (annual %) NY.GDP.MKTP.KD.ZG NaN 0.460106 7.868013 5.616400 4.668135 5.138990 ... 3.011572 2.183581 2.651400 2.681718 2.189259 -2.780379 4.497380 3.554422 2.622310 NaN
2 Afghanistan AFG GDP growth (annual %) NY.GDP.MKTP.KD.ZG NaN NaN NaN NaN NaN NaN ... 1.451315 2.260314 2.647003 1.189228 3.911603 -2.351101 -20.738839 -6.240172 NaN NaN
3 Africa Western and Central AFW GDP growth (annual %) NY.GDP.MKTP.KD.ZG NaN 1.873455 3.707643 7.145784 5.406403 4.102491 ... 2.813979 0.141489 2.284961 2.842645 3.222029 -1.004417 3.994854 3.784486 3.342776 NaN
4 Angola AGO GDP growth (annual %) NY.GDP.MKTP.KD.ZG NaN NaN NaN NaN NaN NaN ... 0.943572 -2.580111 -0.147150 -1.316362 -0.702273 -5.638215 1.199211 3.045403 0.856001 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 Kosovo XKX GDP growth (annual %) NY.GDP.MKTP.KD.ZG NaN NaN NaN NaN NaN NaN ... 5.916231 5.571775 4.825656 3.406632 4.756801 -5.340275 10.745656 4.278499 3.337895 NaN
262 Yemen, Rep. YEM GDP growth (annual %) NY.GDP.MKTP.KD.ZG NaN NaN NaN NaN NaN NaN ... -27.994546 -9.375124 -5.071796 0.752448 NaN NaN NaN NaN NaN NaN
263 South Africa ZAF GDP growth (annual %) NY.GDP.MKTP.KD.ZG NaN 3.844734 6.177931 7.373709 7.939609 6.122798 ... 1.321862 0.664552 1.157947 1.556784 0.259936 -5.963358 4.703062 1.910406 0.601662 NaN
264 Zambia ZMB GDP growth (annual %) NY.GDP.MKTP.KD.ZG NaN 1.361382 -2.490839 3.272393 12.214048 16.647456 ... 2.920375 3.755100 3.525863 4.034494 1.441306 -2.785055 6.234922 5.249622 5.832474 NaN
265 Zimbabwe ZWE GDP growth (annual %) NY.GDP.MKTP.KD.ZG NaN 6.316157 1.434471 6.244345 -1.106172 4.910571 ... 2.023650 0.900955 4.080264 5.009867 -6.332446 -7.816951 8.468017 6.522375 4.955782 NaN

266 rows × 69 columns

Dataset 14¶

This dataset covers info about Global gdp in Current US$

In [20]:
df13=pd.read_csv('./data/API_NY.GDP.MKTP.CD_DS2_en_csv_v2_31795.csv', skiprows=4)
df13
Out[20]:
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2015 2016 2017 2018 2019 2020 2021 2022 2023 Unnamed: 68
0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.962907e+09 2.983635e+09 3.092429e+09 3.276184e+09 3.395799e+09 2.558906e+09 3.103184e+09 3.544708e+09 NaN NaN
1 Africa Eastern and Southern AFE GDP (current US$) NY.GDP.MKTP.CD 2.121696e+10 2.230747e+10 2.370247e+10 2.577938e+10 2.804954e+10 3.037491e+10 ... 8.992957e+11 8.298300e+11 9.401055e+11 1.012719e+12 1.006527e+12 9.290741e+11 1.086772e+12 1.183962e+12 1.236163e+12 NaN
2 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 1.913422e+10 1.811657e+10 1.875346e+10 1.805322e+10 1.879944e+10 1.995593e+10 1.426650e+10 1.450216e+10 NaN NaN
3 Africa Western and Central AFW GDP (current US$) NY.GDP.MKTP.CD 1.188413e+10 1.268566e+10 1.360683e+10 1.443998e+10 1.576911e+10 1.693448e+10 ... 7.693673e+11 6.921811e+11 6.857502e+11 7.681896e+11 8.239336e+11 7.871467e+11 8.459930e+11 8.771408e+11 7.965862e+11 NaN
4 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 9.049642e+10 5.276162e+10 7.369015e+10 7.945069e+10 7.089796e+10 4.850156e+10 6.650513e+10 1.043997e+11 8.472296e+10 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 Kosovo XKX GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 6.295848e+09 6.682677e+09 7.180765e+09 7.878760e+09 7.899738e+09 7.717145e+09 9.413404e+09 9.354903e+09 1.043835e+10 NaN
262 Yemen, Rep. YEM GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 4.244449e+10 3.131783e+10 2.684223e+10 2.160616e+10 NaN NaN NaN NaN NaN NaN
263 South Africa ZAF GDP (current US$) NY.GDP.MKTP.CD 8.748597e+09 9.225996e+09 9.813996e+09 1.085420e+10 1.195600e+10 1.306899e+10 ... 3.467098e+11 3.235855e+11 3.814488e+11 4.052607e+11 3.893300e+11 3.382914e+11 4.201178e+11 4.052709e+11 3.777816e+11 NaN
264 Zambia ZMB GDP (current US$) NY.GDP.MKTP.CD 7.130000e+08 6.962857e+08 6.931429e+08 7.187143e+08 8.394286e+08 1.082857e+09 ... 2.125122e+10 2.095841e+10 2.587360e+10 2.631151e+10 2.330867e+10 1.813776e+10 2.209642e+10 2.916378e+10 2.816263e+10 NaN
265 Zimbabwe ZWE GDP (current US$) NY.GDP.MKTP.CD 1.052990e+09 1.096647e+09 1.117602e+09 1.159512e+09 1.217138e+09 1.311436e+09 ... 1.996312e+10 2.054868e+10 1.758489e+10 3.415607e+10 2.183223e+10 2.150970e+10 2.837124e+10 2.736663e+10 2.653827e+10 NaN

266 rows × 69 columns

Dataset 15¶

This dataset covers info about Population

In [21]:
df14=pd.read_csv('./data/API_SP.POP.TOTL_DS2_en_csv_v2_31753.csv', skiprows=4)
df14
Out[21]:
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2015 2016 2017 2018 2019 2020 2021 2022 2023 Unnamed: 68
0 Aruba ABW Population, total SP.POP.TOTL 54608.0 55811.0 56682.0 57475.0 58178.0 58782.0 ... 104257.0 104874.0 105439.0 105962.0 106442.0 106585.0 106537.0 106445.0 106277.0 NaN
1 Africa Eastern and Southern AFE Population, total SP.POP.TOTL 130692579.0 134169237.0 137835590.0 141630546.0 145605995.0 149742351.0 ... 600008424.0 616377605.0 632746570.0 649757148.0 667242986.0 685112979.0 702977106.0 720859132.0 739108306.0 NaN
2 Afghanistan AFG Population, total SP.POP.TOTL 8622466.0 8790140.0 8969047.0 9157465.0 9355514.0 9565147.0 ... 33753499.0 34636207.0 35643418.0 36686784.0 37769499.0 38972230.0 40099462.0 41128771.0 42239854.0 NaN
3 Africa Western and Central AFW Population, total SP.POP.TOTL 97256290.0 99314028.0 101445032.0 103667517.0 105959979.0 108336203.0 ... 408690375.0 419778384.0 431138704.0 442646825.0 454306063.0 466189102.0 478185907.0 490330870.0 502789511.0 NaN
4 Angola AGO Population, total SP.POP.TOTL 5357195.0 5441333.0 5521400.0 5599827.0 5673199.0 5736582.0 ... 28127721.0 29154746.0 30208628.0 31273533.0 32353588.0 33428486.0 34503774.0 35588987.0 36684202.0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 Kosovo XKX Population, total SP.POP.TOTL 990150.0 1014211.0 1038618.0 1063175.0 1087700.0 1111812.0 ... 1788196.0 1777557.0 1791003.0 1797085.0 1788878.0 1790133.0 1786038.0 1768086.0 1756374.0 NaN
262 Yemen, Rep. YEM Population, total SP.POP.TOTL 5542459.0 5646668.0 5753386.0 5860197.0 5973803.0 6097298.0 ... 28516545.0 29274002.0 30034389.0 30790513.0 31546691.0 32284046.0 32981641.0 33696614.0 34449825.0 NaN
263 South Africa ZAF Population, total SP.POP.TOTL 16520441.0 16989464.0 17503133.0 18042215.0 18603097.0 19187194.0 ... 55876504.0 56422274.0 56641209.0 57339635.0 58087055.0 58801927.0 59392255.0 59893885.0 60414495.0 NaN
264 Zambia ZMB Population, total SP.POP.TOTL 3119430.0 3219451.0 3323427.0 3431381.0 3542764.0 3658024.0 ... 16248230.0 16767761.0 17298054.0 17835893.0 18380477.0 18927715.0 19473125.0 20017675.0 20569737.0 NaN
265 Zimbabwe ZWE Population, total SP.POP.TOTL 3806310.0 3925952.0 4049778.0 4177931.0 4310332.0 4447149.0 ... 14154937.0 14452704.0 14751101.0 15052184.0 15354608.0 15669666.0 15993524.0 16320537.0 16665409.0 NaN

266 rows × 69 columns

Dataset 16¶

This dataset covers Global Gini Indexes.

In [22]:
df16 = pd.read_csv('./data/API_SI.POV.GINI_DS2_en_csv_v2_31732.csv', skiprows=4)
df16
Out[22]:
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2015 2016 2017 2018 2019 2020 2021 2022 2023 Unnamed: 68
0 Aruba ABW Gini index SI.POV.GINI NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Africa Eastern and Southern AFE Gini index SI.POV.GINI NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Afghanistan AFG Gini index SI.POV.GINI NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Africa Western and Central AFW Gini index SI.POV.GINI NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Angola AGO Gini index SI.POV.GINI NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 51.3 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 Kosovo XKX Gini index SI.POV.GINI NaN NaN NaN NaN NaN NaN ... 26.5 26.7 29.0 NaN NaN NaN NaN NaN NaN NaN
262 Yemen, Rep. YEM Gini index SI.POV.GINI NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
263 South Africa ZAF Gini index SI.POV.GINI NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
264 Zambia ZMB Gini index SI.POV.GINI NaN NaN NaN NaN NaN NaN ... 55.8 NaN NaN NaN NaN NaN NaN 51.5 NaN NaN
265 Zimbabwe ZWE Gini index SI.POV.GINI NaN NaN NaN NaN NaN NaN ... NaN NaN 44.3 NaN 50.3 NaN NaN NaN NaN NaN

266 rows × 69 columns

Dataset 17¶

This dataset covers Global Income share held by highest 10%.

In [23]:
df17 = pd.read_csv('./data/API_SI.DST.10TH.10_DS2_en_csv_v2_12274.csv', skiprows=4)
df17
Out[23]:
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2015 2016 2017 2018 2019 2020 2021 2022 2023 Unnamed: 68
0 Aruba ABW Income share held by highest 10% SI.DST.10TH.10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Africa Eastern and Southern AFE Income share held by highest 10% SI.DST.10TH.10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Afghanistan AFG Income share held by highest 10% SI.DST.10TH.10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Africa Western and Central AFW Income share held by highest 10% SI.DST.10TH.10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Angola AGO Income share held by highest 10% SI.DST.10TH.10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 39.6 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 Kosovo XKX Income share held by highest 10% SI.DST.10TH.10 NaN NaN NaN NaN NaN NaN ... 21.5 21.8 24.6 NaN NaN NaN NaN NaN NaN NaN
262 Yemen, Rep. YEM Income share held by highest 10% SI.DST.10TH.10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
263 South Africa ZAF Income share held by highest 10% SI.DST.10TH.10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
264 Zambia ZMB Income share held by highest 10% SI.DST.10TH.10 NaN NaN NaN NaN NaN NaN ... 43.4 NaN NaN NaN NaN NaN NaN 39.1 NaN NaN
265 Zimbabwe ZWE Income share held by highest 10% SI.DST.10TH.10 NaN NaN NaN NaN NaN NaN ... NaN NaN 34.8 NaN NaN NaN NaN NaN NaN NaN

266 rows × 69 columns

Dataset 18¶

This dataset covers Income share held by highest 20%

In [24]:
df18 = pd.read_csv('./data/API_SI.DST.05TH.20_DS2_en_csv_v2_9120.csv', skiprows=4)
df18
Out[24]:
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2015 2016 2017 2018 2019 2020 2021 2022 2023 Unnamed: 68
0 Aruba ABW Income share held by highest 20% SI.DST.05TH.20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Africa Eastern and Southern AFE Income share held by highest 20% SI.DST.05TH.20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Afghanistan AFG Income share held by highest 20% SI.DST.05TH.20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Africa Western and Central AFW Income share held by highest 20% SI.DST.05TH.20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Angola AGO Income share held by highest 20% SI.DST.05TH.20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 55.6 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 Kosovo XKX Income share held by highest 20% SI.DST.05TH.20 NaN NaN NaN NaN NaN NaN ... 36.1 36.3 38.6 NaN NaN NaN NaN NaN NaN NaN
262 Yemen, Rep. YEM Income share held by highest 20% SI.DST.05TH.20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
263 South Africa ZAF Income share held by highest 20% SI.DST.05TH.20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
264 Zambia ZMB Income share held by highest 20% SI.DST.05TH.20 NaN NaN NaN NaN NaN NaN ... 60.2 NaN NaN NaN NaN NaN NaN 56.4 NaN NaN
265 Zimbabwe ZWE Income share held by highest 20% SI.DST.05TH.20 NaN NaN NaN NaN NaN NaN ... NaN NaN 51.1 NaN NaN NaN NaN NaN NaN NaN

266 rows × 69 columns

Dataset 19¶

This code covers Income share held by lowest 20%

In [25]:
df19 = pd.read_csv('./data/API_SI.DST.FRST.20_DS2_en_csv_v2_3154.csv', skiprows=4)
df19
Out[25]:
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2015 2016 2017 2018 2019 2020 2021 2022 2023 Unnamed: 68
0 Aruba ABW Income share held by lowest 20% SI.DST.FRST.20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Africa Eastern and Southern AFE Income share held by lowest 20% SI.DST.FRST.20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Afghanistan AFG Income share held by lowest 20% SI.DST.FRST.20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Africa Western and Central AFW Income share held by lowest 20% SI.DST.FRST.20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Angola AGO Income share held by lowest 20% SI.DST.FRST.20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 3.8 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 Kosovo XKX Income share held by lowest 20% SI.DST.FRST.20 NaN NaN NaN NaN NaN NaN ... 9.5 9.4 9.2 NaN NaN NaN NaN NaN NaN NaN
262 Yemen, Rep. YEM Income share held by lowest 20% SI.DST.FRST.20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
263 South Africa ZAF Income share held by lowest 20% SI.DST.FRST.20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
264 Zambia ZMB Income share held by lowest 20% SI.DST.FRST.20 NaN NaN NaN NaN NaN NaN ... 3.1 NaN NaN NaN NaN NaN NaN 3.9 NaN NaN
265 Zimbabwe ZWE Income share held by lowest 20% SI.DST.FRST.20 NaN NaN NaN NaN NaN NaN ... NaN NaN 6.0 NaN NaN NaN NaN NaN NaN NaN

266 rows × 69 columns

Dataset 20¶

This dataset covers Income share held by lowest 10%

In [26]:
df20 = pd.read_csv('./data/API_SI.DST.FRST.10_DS2_en_csv_v2_12272.csv', skiprows=4)
df20
Out[26]:
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2015 2016 2017 2018 2019 2020 2021 2022 2023 Unnamed: 68
0 Aruba ABW Income share held by lowest 10% SI.DST.FRST.10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Africa Eastern and Southern AFE Income share held by lowest 10% SI.DST.FRST.10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Afghanistan AFG Income share held by lowest 10% SI.DST.FRST.10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Africa Western and Central AFW Income share held by lowest 10% SI.DST.FRST.10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Angola AGO Income share held by lowest 10% SI.DST.FRST.10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 1.3 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 Kosovo XKX Income share held by lowest 10% SI.DST.FRST.10 NaN NaN NaN NaN NaN NaN ... 4.0 3.8 3.8 NaN NaN NaN NaN NaN NaN NaN
262 Yemen, Rep. YEM Income share held by lowest 10% SI.DST.FRST.10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
263 South Africa ZAF Income share held by lowest 10% SI.DST.FRST.10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
264 Zambia ZMB Income share held by lowest 10% SI.DST.FRST.10 NaN NaN NaN NaN NaN NaN ... 1.2 NaN NaN NaN NaN NaN NaN 1.5 NaN NaN
265 Zimbabwe ZWE Income share held by lowest 10% SI.DST.FRST.10 NaN NaN NaN NaN NaN NaN ... NaN NaN 2.5 NaN NaN NaN NaN NaN NaN NaN

266 rows × 69 columns

Total Records¶

Please summarize the total number of records (rows and columns) per dataset used in the project. This will give a general overview of the scope of your project.
  • https://www.iban.com/country-codes 249 rows × 4 columns
  • M850011.csv, 63 rows × 12 columns
  • M850581.csv, 34 rows × 180 columns
  • M850671.csv, 15 rows × 7 columns
  • M850691.csv, 7 rows × 18 columns
  • M183401.csv, 32 rows × 19 columns
  • M015651.csv, 198 rows × 29 columns
  • 17892.csv, 25 rows × 2 columns
  • DEMO_DS_23082024021945252.csv, 27114 rows × 9 columns
  • NATMON_DS_23082024022406407.csv, 3781 rows × 9 columns
  • DataForTable2 (1).csv, 2363 rows × 11 columns
  • API_SE.XPD.TOTL.GD.ZS_DS2_en_csv_v2_68.csv, 266 rows × 69 columns
  • API_NY.GDP.MKTP.KD.ZG_DS2_en_csv_v2_59.csv, 266 rows × 69 columns
  • API_NY.GDP.MKTP.CD_DS2_en_csv_v2_31795.csv, 266 rows × 69 columns
  • API_SP.POP.TOTL_DS2_en_csv_v2_31753.csv, 266 rows × 69 columns
  • API_SI.POV.GINI_DS2_en_csv_v2_31732.csv, 266 rows × 69 columns
  • API_SI.DST.10TH.10_DS2_en_csv_v2_12274.csv, 266 rows × 69 columns
  • API_SI.DST.05TH.20_DS2_en_csv_v2_9120.csv, 266 rows × 69 columns
  • API_SI.DST.FRST.20_DS2_en_csv_v2_3154.csv, 266 rows × 69 columns
  • API_SI.DST.FRST.10_DS2_en_csv_v2_12272.csv, 266 rows × 69 columns

Data Cleaning & EDA¶

  • Do not manually perform any data cleaning steps — all data cleaning should be done programmatically, by Python code that you write. That is, you should not be searching in csv files in excel and filling in empty boxes or changing strings to numbers etc.

  • For data cleaning, be clear in which dataset (or variables) are used, what has been done for missing data, how was merging performed, explanation of data transformation (if any).

  • If data is calculated or summarized from the raw dataset, explain the rationale and steps clearly.

For each research questions shortlisted, outline your methodology in answering them. Discuss interesting observations or results discovered.

Please note to only show EDA that's relevant to answering the question at hand. If you have done any data modeling, include in this section.

Question 1¶

Overview¶

In this section, we answer the question, What is the correlation between government expenditure on education and GDP growth?

Firstly, we will select the appropriate data needed to answer the question. In Exploratory Analysis, we attempt to gain insight into the question by analysing appropriate graphs and figures. This will help us gain a better understanding of the question, as well as the relevant data, before we plunge into Quantitative Analysis.

In Quantitative Analysis, we tackle the question head-on, and gain real figures, such as correlations, etc. that will help us answer the question.

Cleaning¶

We will be using the following 6 datasets for this question.

  1. https://tablebuilder.singstat.gov.sg/table/TS/M850011#

    • (SG) Government Expenditure On Education, from 1960 - 2022
  2. https://tablebuilder.singstat.gov.sg/table/TS/M015651

    • (SG) Gross Domestic Product At Current Prices, By Industry (SSIC 2020), from 1975 - 2024
  3. https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS?_gl=1

    • (Global), Government expenditure on education, total (% of GDP)
  4. https://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG?_gl=1

    • (Global), GDP growth (annual %)
  5. https://data.worldbank.org/indicator/NY.GDP.MKTP.CD

    • (Global), GDP (Current US$)
  6. https://data.worldbank.org/indicator/SP.POP.TOTL?name_desc=false

    • (Global), Total Population

Firstly, we will clean the datasets. To maintain uniformity, we will convert everything to USD, where applicable

In [27]:
# Extracting the 'Total Government Expenditure On Education' column and converting values to float
# Multiply by 1000 to convert from thousand dollars, and by 0.78 to adjust for a certain factor (e.g., inflation or scaling).
sg_edu_expense = df1[["Total Government Expenditure On Education (Thousand Dollars)"]].astype(float) * 1000 * 0.78

# Renaming the column to 'Expenditure On Education' for clarity
sg_edu_expense.rename(columns={"Total Government Expenditure On Education (Thousand Dollars)": "Expenditure On Education"}, inplace=True)

# Converting the index to a datetime format, assuming the index represents years and removing any whitespace
sg_edu_expense.index = pd.to_datetime(sg_edu_expense.index.str.strip(), format='%Y')

# Displaying the first 5 rows of the modified DataFrame
display(sg_edu_expense.head(5))

# Providing information about the DataFrame's structure and data types
sg_edu_expense.info()
Data Series Expenditure On Education
2022-01-01 1.018721e+10
2021-01-01 1.006973e+10
2020-01-01 9.562685e+09
2019-01-01 9.924155e+09
2018-01-01 1.004327e+10
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 63 entries, 2022-01-01 to 1960-01-01
Data columns (total 1 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Expenditure On Education  63 non-null     float64
dtypes: float64(1)
memory usage: 1008.0 bytes
In [28]:
# Extracting the year part from the index, converting it to float for time calculations
years = df6.index.str[:4].astype(float)

# Extracting the quarter part from the index and mapping it to decimal values (1Q = 0.25, 2Q = 0.50, etc.)
quarters = df6.index.str[5:7].map({'1Q': 0.25, '2Q': 0.50, '3Q': 0.75, '4Q': 1.00}).astype(float)

# Combining years and quarters to get a continuous time variable
time = years + quarters

# Selecting the 'GDP At Current Market Prices' column, converting values to float, and renaming the column to 'GDP'
sg_gdp = df6[['GDP At Current Market Prices']].astype(float).rename(columns={"GDP At Current Market Prices": "GDP"})

# Setting the new 'time' variable as the index for the GDP DataFrame
sg_gdp.index = time

# Scaling GDP values (converting to millions and adjusting by 0.78, perhaps for inflation or other adjustment)
sg_gdp *= 1e6 * 0.78

# Adjusting GDP values by adding the shifted values (lagged by 1, 2, and 3 periods) to account for past GDP values
sg_gdp['GDP'] += sg_gdp['GDP'].shift(1) + sg_gdp['GDP'].shift(2) + sg_gdp['GDP'].shift(3)

# Filtering the DataFrame to include only the rows where the index (time) represents a whole year (e.g., 2000.00, 2001.00)
sg_gdp = sg_gdp[sg_gdp.index % 1 == 0]

# Dropping rows with any missing values (NaNs) to ensure data cleanliness
sg_gdp.dropna(axis=0, inplace=True)

# Converting the index (year) back to datetime format for easier time-based plotting or analysis
sg_gdp.index = pd.to_datetime(sg_gdp.index.astype(int).astype(str))

# Displaying the first 5 rows of the cleaned GDP DataFrame
display(sg_gdp.head(5))

# Displaying information about the structure of the GDP DataFrame
sg_gdp.info()
Data Series GDP
2023-01-01 5.217365e+11
2022-01-01 5.244607e+11
2021-01-01 4.320549e+11
2020-01-01 3.789621e+11
2019-01-01 4.028781e+11
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 48 entries, 2023-01-01 to 1976-01-01
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   GDP     48 non-null     float64
dtypes: float64(1)
memory usage: 768.0 bytes

The above 2 dataframes have no null values. Therefore, we do not need to fill in any missing values. Let us proceed.

This converts a DataFrame of country codes and names into a dictionary where the country codes map to their corresponding country names.

In [29]:
# Selecting the first two columns from df11, assuming the first column contains country codes and the second contains country names
ISOcode_to_name = df11.iloc[:, :2]

# Setting the 'Country Code' column as the index of the DataFrame
ISOcode_to_name.set_index('Country Code', inplace=True)

# Removing the index name ('Country Code') for a cleaner DataFrame
ISOcode_to_name.index.name = None

# Displaying the first few rows of the DataFrame to check the transformation
display(ISOcode_to_name)

# Converting the DataFrame into a dictionary where the index (Country Code) is the key and 'Country Name' is the value
ISOcode_to_name = ISOcode_to_name.to_dict()

# Extracting the 'Country Name' dictionary from the above transformation
ISOcode_to_name = ISOcode_to_name['Country Name']
Country Name
ABW Aruba
AFE Africa Eastern and Southern
AFG Afghanistan
AFW Africa Western and Central
AGO Angola
... ...
XKX Kosovo
YEM Yemen, Rep.
ZAF South Africa
ZMB Zambia
ZWE Zimbabwe

266 rows × 1 columns

This code cleans the global datasets by removing unnecessary columns, filtering rows, transposing the data for better usability, and finally displaying the first few rows of each cleaned DataFrame.

In [30]:
# Making copies of the original DataFrames to avoid modifying the original data
global_edu_expense_percent_gdp  = df11.copy()
global_gdp_growth_percent = df12.copy()
global_gdp = df13.copy()
global_pop = df14.copy()

# Function to clean the DataFrames by removing unnecessary columns, setting index, and filtering rows
def clean_df(df):
    # Dropping irrelevant columns ('Country Name', 'Indicator Name', 'Indicator Code', 'Unnamed: 68')
    df.drop(columns=['Country Name', 'Indicator Name', 'Indicator Code', 'Unnamed: 68'], inplace=True)
    
    # Setting the 'Country Code' column as the index
    df.set_index('Country Code', inplace=True)
    
    # Removing the index name for clarity
    df.index.name = None
    
    # Dropping rows where the country code is not present in 'country_codes_dict'
    df.drop(df[~df.index.isin(country_codes_dict)].index, inplace=True)
    
    # Transposing the DataFrame to have years as rows and country codes as columns
    df = df.T
    
    # Converting the year index to datetime format
    df.index = pd.to_datetime(df.index.astype(int).astype(str))
    
    # Returning the cleaned DataFrame
    return df

# Cleaning all global datasets using the clean_df function
global_edu_expense_percent_gdp = clean_df(global_edu_expense_percent_gdp)
global_gdp_growth_percent = clean_df(global_gdp_growth_percent)
global_gdp = clean_df(global_gdp)
global_pop = clean_df(global_pop)

# Displaying the first three rows of the cleaned Global Educational Expenditure as a Percent of GDP DataFrame
print("\n\033[1mGlobal Educational Expenditure as a Percent of GDP\033[0m")
display(global_edu_expense_percent_gdp.head(3))

# Displaying the first three rows of the cleaned Global GDP Growth (annual %) DataFrame
print("\033[1mGlobal GDP Growth (annual %)\033[0m")
display(global_gdp_growth_percent.head(3))

# Displaying the first three rows of the cleaned Global GDP (USD) DataFrame
print("\033[1mGlobal GDP (USD)\033[0m")
display(global_gdp.head(3))

# Displaying the first three rows of the cleaned Global Population DataFrame
print("\033[1mGlobal Population\033[0m")
display(global_pop.head(3))
Global Educational Expenditure as a Percent of GDP
ABW AFG AGO ALB AND ARE ARG ARM ASM ATG ... VEN VGB VIR VNM VUT WSM YEM ZAF ZMB ZWE
1960-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1961-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1962-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 215 columns

Global GDP Growth (annual %)
ABW AFG AGO ALB AND ARE ARG ARM ASM ATG ... VEN VGB VIR VNM VUT WSM YEM ZAF ZMB ZWE
1960-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1961-01-01 NaN NaN NaN NaN NaN NaN 5.427843 NaN NaN NaN ... 3.192519 NaN NaN NaN NaN NaN NaN 3.844734 1.361382 6.316157
1962-01-01 NaN NaN NaN NaN NaN NaN -0.852022 NaN NaN NaN ... 8.532934 NaN NaN NaN NaN NaN NaN 6.177931 -2.490839 1.434471

3 rows × 215 columns

Global GDP (USD)
ABW AFG AGO ALB AND ARE ARG ARM ASM ATG ... VEN VGB VIR VNM VUT WSM YEM ZAF ZMB ZWE
1960-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 7.663938e+09 NaN NaN NaN NaN NaN NaN 8.748597e+09 7.130000e+08 1.052990e+09
1961-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 8.067267e+09 NaN NaN NaN NaN NaN NaN 9.225996e+09 6.962857e+08 1.096647e+09
1962-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 8.814310e+09 NaN NaN NaN NaN NaN NaN 9.813996e+09 6.931429e+08 1.117602e+09

3 rows × 215 columns

Global Population
ABW AFG AGO ALB AND ARE ARG ARM ASM ATG ... VEN VGB VIR VNM VUT WSM YEM ZAF ZMB ZWE
1960-01-01 54608.0 8622466.0 5357195.0 1608800.0 9443.0 133426.0 20349744.0 1904148.0 20085.0 55342.0 ... 8156937.0 7850.0 32500.0 32718461.0 64608.0 113335.0 5542459.0 16520441.0 3119430.0 3806310.0
1961-01-01 55811.0 8790140.0 5441333.0 1659800.0 10216.0 140984.0 20680653.0 1971530.0 20626.0 56245.0 ... 8453106.0 7885.0 34300.0 33621982.0 66462.0 116820.0 5646668.0 16989464.0 3219451.0 3925952.0
1962-01-01 56682.0 8969047.0 5521400.0 1711319.0 11014.0 148877.0 21020359.0 2039346.0 21272.0 57008.0 ... 8754082.0 7902.0 35000.0 34533889.0 68391.0 120163.0 5753386.0 17503133.0 3323427.0 4049778.0

3 rows × 215 columns

This code calculates Global GDP per Capita by dividing the GDP values by the corresponding population values, and then displays the first 5 rows of the result.

In [31]:
# Calculating Global GDP per Capita by dividing global GDP by global population
global_gdp_per_capita = global_gdp / global_pop

# Printing and displaying the first 5 rows of the calculated Global GDP per Capita
print("\n\033[1mGlobal GDP per Capita\033[0m")
display(global_gdp_per_capita.head(5))
Global GDP per Capita
ABW AFG AGO ALB AND ARE ARG ARM ASM ATG ... VEN VGB VIR VNM VUT WSM YEM ZAF ZMB ZWE
1960-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 939.560806 NaN NaN NaN NaN NaN NaN 529.561923 228.567399 276.643363
1961-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 954.355361 NaN NaN NaN NaN NaN NaN 543.042224 216.274674 279.332656
1962-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1006.879977 NaN NaN NaN NaN NaN NaN 560.699394 208.562685 275.966139
1963-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1060.570324 NaN NaN NaN NaN NaN NaN 601.599951 209.453362 277.532515
1964-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 874.199411 NaN NaN NaN NaN NaN NaN 642.688431 236.941713 282.376856

5 rows × 215 columns

This code reshapes, merges, and processes data to calculate relevant metrics (such as education expenditure and expenditure per capita) and cleans up the final dataset for further analysis.

In [32]:
# Reshaping the global population data into long format (year, country_code, population)
temp_pop = global_pop.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": "country_code", "value": "population"})

# Reshaping the global GDP data into long format (year, country_code, GDP)
temp_gdp = global_gdp.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": "country_code", "value": "GDP"})

# Reshaping the global GDP per capita data into long format (year, country_code, GDP_per_capita)
temp_gdp_per_capita = global_gdp_per_capita.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": "country_code", "value": "GDP_per_capita"})

# Reshaping the global education expenditure as a percent of GDP data into long format (year, country_code, edu_expense_per_GDP)
temp_edu_expense_percent_gdp = global_edu_expense_percent_gdp.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": "country_code", "value": "edu_expense_per_GDP"})

# Merging all datasets into one (year, country_code), adding population, GDP, GDP_per_capita, and edu_expense_per_GDP for each combination
q1_df = temp_pop.merge(temp_gdp, on=['year', 'country_code'])\
                .merge(temp_gdp_per_capita, on=['year', 'country_code'])\
                .merge(temp_edu_expense_percent_gdp, on=['year', 'country_code'])

# Calculating the total education expenditure based on the percentage of GDP
q1_df['edu_expense'] = q1_df['edu_expense_per_GDP'] * q1_df['GDP'] / 100

# Calculating the education expenditure per capita by dividing the total expenditure by the population
q1_df['edu_expense_per_capita'] = q1_df['edu_expense'] / q1_df['population']

# Mapping country codes to country names using a dictionary
q1_df['country_name'] = q1_df['country_code'].map(country_codes_dict)

# Creating a raw backup of the original DataFrame before dropping missing values
q1_df_raw = q1_df

# Dropping rows with missing values and resetting the index for the cleaned dataset
q1_df = q1_df.dropna().reset_index(drop=True)

# Display the cleaned dataset
q1_df
Out[32]:
year country_code population GDP GDP_per_capita edu_expense_per_GDP edu_expense edu_expense_per_capita country_name
0 1998-01-01 ABW 84355.0 1.665363e+09 19742.316739 4.76316 7.932391e+07 940.358179 Aruba
1 1999-01-01 ABW 86867.0 1.722905e+09 19833.826746 4.39527 7.572633e+07 871.750211 Aruba
2 2000-01-01 ABW 89101.0 1.873453e+09 21026.167091 4.71536 8.834003e+07 991.459507 Aruba
3 2001-01-01 ABW 90691.0 1.896457e+09 20911.192767 4.79850 9.100149e+07 1003.423598 Aruba
4 2002-01-01 ABW 91781.0 1.961844e+09 21375.269123 4.87178 9.557670e+07 1041.356069 Aruba
... ... ... ... ... ... ... ... ... ...
4858 2014-01-01 ZWE 13855753.0 1.949552e+10 1407.034291 6.13835 1.196703e+09 86.368690 Zimbabwe
4859 2015-01-01 ZWE 14154937.0 1.996312e+10 1410.329173 5.81300 1.160456e+09 81.982438 Zimbabwe
4860 2016-01-01 ZWE 14452704.0 2.054868e+10 1421.787791 5.47300 1.124629e+09 77.814447 Zimbabwe
4861 2017-01-01 ZWE 14751101.0 1.758489e+10 1192.107012 5.81878 1.023226e+09 69.366084 Zimbabwe
4862 2018-01-01 ZWE 15052184.0 3.415607e+10 2269.177012 2.05049 7.003668e+08 46.529246 Zimbabwe

4863 rows × 9 columns

Exploratory Analysis¶

Let's look at a graph displaying the Singapore data relevant to this question

In [33]:
# Create a figure and the first subplot (ax1) with a specific figure size
fig, ax1 = plt.subplots(figsize=(10, 6))

# Plot Singapore education expenditure over time (in billion USD) on ax1 using a blue line
ax1.plot(sg_edu_expense.index, sg_edu_expense / 1e9, color='blue')

# Set the label for the y-axis on ax1, representing education expenditure, with the label in blue color
ax1.set_ylabel("Education Expenditure (Billion USD)", color='blue')

# Change the color of the y-axis tick labels on ax1 to blue
ax1.tick_params(axis='y', labelcolor='blue')

# Create a second y-axis (ax2) that shares the same x-axis as ax1 for plotting GDP
ax2 = ax1.twinx()

# Plot Singapore GDP over time (in billion USD) on ax2 using a green line
ax2.plot(sg_gdp.index, sg_gdp / 1e9, color='green')

# Set the label for the y-axis on ax2, representing GDP, with the label in green color
ax2.set_ylabel("Singapore GDP (Billion USD)", color='green')

# Change the color of the y-axis tick labels on ax2 to green
ax2.tick_params(axis='y', labelcolor='green')

# Display the final plot with both lines (education expenditure and GDP)
plt.title('SGP GDP and Edu Expenditure over Time')
plt.show()
No description has been provided for this image

From the graph, it is quite clear that there is a strong correlation between Singapore GDP and Education Expenditure. Now, lets take a look at some global data.

In [34]:
# Create a new figure object for the subplots
fig = plt.figure()

# Add a subplot (ax0) on the left side (1 row, 2 columns, 1st subplot)
ax0 = fig.add_subplot(1, 2, 1)

# Add a subplot (ax1) on the right side (1 row, 2 columns, 2nd subplot)
ax1 = fig.add_subplot(1, 2, 2)

# Create a box plot of the average global education expenditure (% of GDP) over the last 20 years
# Use the color blue and make the plot horizontal, then place it in the first subplot (ax0)
global_edu_expense_percent_gdp.iloc[-20:, :].mean().plot(kind='box', color='blue', vert=False, figsize=(20, 8), ax=ax0)

# Set the title for the first subplot (ax0)
ax0.set_title('Box Plot of Global Education Expenditure (% of GDP)')

# Set the x-axis label for the first subplot (ax0)
ax0.set_xlabel('Education Expenditure (% of GDP)')

# Create a histogram of the average global education expenditure (% of GDP) over the last 20 years
# Place the histogram in the second subplot (ax1)
global_edu_expense_percent_gdp.iloc[-20:, :].mean().plot(kind='hist', figsize=(20, 8), ax=ax1)

# Set the title for the second subplot (ax1)
ax1.set_title('Distribution of Global Education Expenditure (% of GDP)')

# Set the y-axis label for the second subplot (ax1)
ax1.set_ylabel('Number of Countries')

# Set the x-axis label for the second subplot (ax1)
ax1.set_xlabel('Education Expenditure (% of GDP)')

# Set a main title for the entire figure
fig.suptitle('Global Education Expenditure Analysis', fontsize=16)

# Display the figure with both subplots
plt.show()
No description has been provided for this image

From the graphs, it is quite clear that the vast majority of countries spend between 3-5% of their GDP on Education. However, note that this also means that most countries GDP are 20-30 times their Education Expenditure. Now, we will explore the variations between the countries, and figure out which countries spend less on their Education (as a percent of their GDP) and which countries spend more.

The below code shows an animated chloropleth graph which helps to visualise the Education Expenditure as a percent of GDP over countries and time.

In [35]:
# Fill forward missing values in the global_edu_expense_percent_gdp dataframe to handle missing data
map_data = global_edu_expense_percent_gdp.ffill().reset_index()

# Reshape the data using melt to make it long-form for plotting
# 'index' becomes 'year', and country ISO codes will be the 'iso_code'
map_data = map_data.melt(id_vars='index').rename(columns={"variable": "iso_code", "index": "year"})

# Map the ISO country codes to their corresponding country names using a dictionary
map_data['country_name'] = map_data['iso_code'].map(country_codes_dict)

# Convert the 'year' column to only include the year (dropping month/day if present)
map_data.year = map_data.year.dt.year

# Create an animated choropleth map using Plotly Express
px.choropleth(
    map_data,                        # Input dataset
    locations="iso_code",            # The ISO codes to be used for location
    color="value",                   # Column to represent in color scale (education expenditure as % of GDP)
    hover_name="country_name",       # The name to display when hovering over a country
    animation_frame="year",          # Column for creating the year-by-year animation
    color_continuous_scale=px.colors.diverging.RdBu,  # Color scale for the choropleth
    height=600,                      # Height of the figure
    range_color=[3, 7],               # Set the range for the color scale (3% to 7% for education expenditure)
    title="Edu Expenditure (% GDP) over Time"
)

From the graph (from 2005 - 2023), we can see that most developed countries, such as USA, Australia, Sweden, Finland, France all have high education expenditure (blue) whereas most underdeveloped countries, such as countries in Africa and Asia have low education expenditure (brown). We will explore this correlation in the Quantitative Analysis section.

The graphs below show a treemap of GDP, Population GDP per Capita in 2023

In [36]:
treemap_data = global_gdp.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": "iso_code"})
treemap_data = treemap_data[treemap_data.year=='2023']
treemap_data['country_name'] = treemap_data['iso_code'].map(country_codes_dict)
fig = px.treemap(
    treemap_data,
    path=['iso_code'],  # Path defines the hierarchy (here we only use Country)
    values='value',      # The size of the rectangles will be based on GDP
    color='value', # The color of the rectangles will be based on Population
    hover_data=['country_name', 'value'],  # Data to show on hover
    color_continuous_scale='Plasma', # Diverging color scale
    title="Treemap of GDP"
)
fig.show()

treemap_data = global_pop.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": "iso_code"})
treemap_data = treemap_data[treemap_data.year=='2023']
treemap_data['country_name'] = treemap_data['iso_code'].map(country_codes_dict)
fig = px.treemap(
    treemap_data,
    path=['iso_code'],  # Path defines the hierarchy (here we only use Country)
    values='value',      # The size of the rectangles will be based on GDP
    color='value', # The color of the rectangles will be based on Population
    hover_data=['country_name', 'value'],  # Data to show on hover
    color_continuous_scale='Plasma', # Diverging color scale
    title="Treemap of Population"
)
fig.show()

treemap_data = global_gdp_per_capita.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": "iso_code"})
treemap_data = treemap_data[treemap_data.year=='2023']
treemap_data['country_name'] = treemap_data['iso_code'].map(country_codes_dict)
fig = px.treemap(
    treemap_data,
    path=['iso_code'],  # Path defines the hierarchy (here we only use Country)
    values='value',      # The size of the rectangles will be based on GDP
    color='value', # The color of the rectangles will be based on Population
    hover_data=['country_name', 'value'],  # Data to show on hover
    color_continuous_scale='Plasma', # Diverging color scale
    title="Treemap of GDP Per Capita"
)
fig.show()

We now have the choice of either finding correlatiion between education expenditure on GDP or GDP per Capita. We will go with GDP per Capita since this would not be inflated by the country's population. For example, Singapore has a relatively low gross GDP even though it has a high education expenditure (as a percent of GDP). This is mainly because of its low population. Therefore, factoring out the population will allow us to determine the true impact of education expenditure on GDP.

Now, we will do some quantitative analysis between education expenditure and GDP

Quantitative Analysis¶

In this section, we will proceed with quantitative analysis. In other words, we will justify (quantitatively) some of the claims in the previous section. The DataFrame below is what we will be using throughout this section.

In [37]:
plt.figure(figsize=(8, 6))
sns.heatmap(q1_df.iloc[:, 2:-1].corr(), annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Heatmap of Correlation Between Numerical Variables')
plt.show()
No description has been provided for this image

Right away, notice the strong positive correlations between Educational Expenditure and GDP as well as their per_capita counterparts. This clearly shows how Educational Expenditure is tied very strongly to GDP.

Now, put yourself in the government's shoes. What one really wants to know is how much of the GDP to allocate to Education, in order to reap future rewards in GDP (per capita). The relevant columns in the DataFrame above are edu_expense_per_GDP and GDP_per_capita. Although they show a correlation of 0.06, we will proceed to investigate this further. Note that this is very relevant to the overarching research question, What is the quantitative impact of government expenditure on education on GDP growth?

Firstly, let us now explore the impact of some countries with low population (less than 1 million)

In [38]:
q1_df[q1_df.population <= 1e6] \
    .groupby(['country_code', 'country_name'])[['population', 'GDP_per_capita']] \
    .mean() \
    .sort_values('GDP_per_capita', ascending=False) \
    .head(10)
Out[38]:
population GDP_per_capita
country_code country_name
MCO Monaco 33437.050000 136853.732749
LIE Liechtenstein 34988.166667 123122.192944
CYM Cayman Islands (the) 65509.000000 87539.048125
BMU Bermuda 62325.333333 77884.442188
LUX Luxembourg 456808.763158 63137.489784
SMR San Marino 33483.909091 49320.200374
ISL Iceland 310283.833333 47582.679899
MAC Macao 547816.125000 47115.717361
AND Andorra 74414.761905 41828.512092
BHS Bahamas (the) 392210.428571 29106.289630

Clearly, countries with very low population should not be part of our analysis. This is because such countries serve a very different purpose as compared to the rest. For example, Monaco, The Cayman Islands and The Bahamas are not exactly famous for their education. Therefore, in the interest of more conventional nations, such as Singapore, China, India, USA, we will exclude such countries from our analysis.

Secondly, let us investigate the effect of nations with low GDP per capita (less that $5000 USD)

In [39]:
q1_df[q1_df.GDP_per_capita <= 5000] \
    .groupby(['country_code', 'country_name'])[['GDP_per_capita']] \
    .mean() \
    .head(10)
Out[39]:
GDP_per_capita
country_code country_name
AFG Afghanistan 516.019290
AGO Angola 2353.842443
ALB Albania 2650.234690
ARG Argentina 3426.489600
ARM Armenia 2790.690035
AUT Austria 3177.055201
AZE Azerbaijan 2138.250456
BDI Burundi 200.538067
BEN Benin 954.686457
BFA Burkina Faso 550.390259

Most of such countries are classified as developing or underdeveloped. In such countries, there are many other factors at play in determining the GDP, due to their inherent instability. Therefore, the percentage of GDP they devote to education will not have a strong influence on the GDP growth. Since this analysis is more geared towards developed nations, we will also exclude such countries from our analysis.

Now, lets view a scatterplot of GDP_per_capita vs edu_expense_per_GDP, after excluding countries with low population (< 1 million) and low GDP_per_capita (< $5000 USD)

In [40]:
q1_df_developed = q1_df[(q1_df['population'] >= 1000000) & (q1_df['GDP_per_capita'] >= 5000)]
fig = sns.regplot(data=q1_df_developed, x='edu_expense_per_GDP', y='GDP_per_capita')
plt.xlabel('Edu Expense (% GDP)')
plt.ylabel('GDP per Capita')
plt.title('GDP per Capita vs Edu Expense (% GDP)')
plt.show()
No description has been provided for this image
In [41]:
#calculate correlation
pearson_coef, p_value = stats.pearsonr(q1_df_developed['edu_expense_per_GDP'], q1_df_developed['GDP_per_capita'])
print("Pearson Coefficient: ", pearson_coef)
print("P-Value: ", p_value)
Pearson Coefficient:  0.21328518427710413
P-Value:  5.305722840202821e-19

We now get that Education Expenditure as a percent of GDP is weakly related to GDP_per_capita (with high certainty).

In the last decade, there have been many advances in technology, and a country's GDP is getting ever more closely related to its population's literacy. There has been growing demand from many new sectors, and new technologies have been replacing otherwise mundane tasks. Therefore, to wrap up our discussion for this question, let us delve into the correlation between Education Expenditure as a percent of GDP and GDP_per_capita during the last decade.

In [42]:
mean_edu_expense = pd.DataFrame(global_edu_expense_percent_gdp.iloc[-10:, :].mean(axis=0), columns=['Mean Education Expenditure (% of GDP)'])
mean_gdp_per_capita = pd.DataFrame(global_gdp_per_capita.iloc[-10:, :].mean(axis=0), columns=['Mean GDP Per Capita'])
mean_pop = pd.DataFrame(global_pop.iloc[-10:, :].mean(axis=0), columns=['Mean Population'])
merged_df = pd.merge(mean_edu_expense, mean_gdp_per_capita, left_index=True, right_index=True).merge(mean_pop, right_index=True, left_index=True)

def pearson_coef(min_pop, min_gdp_per_capita):
    return merged_df[(merged_df['Mean Population'] >= min_pop) & (merged_df['Mean GDP Per Capita'] >= min_gdp_per_capita)].corr().iloc[0, 1]

pop_ranges = range(10000000, 1000000, -1000000)
gdp_ranges = range(5000, 10001, 1000)

correlation_matrix = pd.DataFrame(index=pop_ranges, columns=gdp_ranges)

for min_pop in pop_ranges:
    for min_gdp in gdp_ranges:
        correlation_matrix.loc[min_pop, min_gdp] = pearson_coef(min_pop, min_gdp)

correlation_matrix = correlation_matrix.apply(pd.to_numeric, errors='coerce')

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True,
            xticklabels=[int(gdp / 1000) for gdp in gdp_ranges], yticklabels=[int(pop / 1e6) for pop in pop_ranges])
plt.title('Correlation between Education Expenditure (% GDP) and GDP per Capita')
plt.xlabel('Min GDP per Capita Threshold (Thousand USD)')
plt.ylabel('Min Population Threshold (Millions)')
plt.show()
No description has been provided for this image

From the graph, we can see that as we increase the Min Population Threshold and the Min GDP per capita Threshold, the correlation between Education Expenditure as a percent of GDP and GDP_per_capita gets stronger and stronger. This finally brings us to the conclusion to this question.

Findings¶
  1. A country's GDP has a very strong correlation with its expenditure on education
  2. The percentage of GDP a country decides to invest in education has a correlation with its GDP per capita. As a country gets bigger and more developed, this correlation gets stronger.

Question 2¶

Overview¶

In this section, we will aim to answer the question, To what extent does government expenditure on education reduce income inequality, as measured by the Gini coefficient?

Firstly, an overview of the indicators:

  1. Gini Index (SI.POV.GINI) Measures income inequality on a scale of 0 to 100. A value of 0 indicates perfect equality, while 100 represents perfect inequality.

  2. Income Share Held by Highest 10% (SI.DST.10TH.10) Percentage of total income earned by the top 10% of the population. A higher percentage indicates greater income concentration and inequality.

  3. Income Share Held by Highest 20% (SI.DST.05TH.20) Percentage of total income held by the top 20%. Similar to the previous indicator, it reflects overall income distribution and inequality.

  4. Income Share Held by Lowest 10% (SI.DST.FRST.10) Percentage of total income received by the lowest 10%. A lower value indicates poorer income distribution and highlights poverty issues.

  5. Income Share Held by Lowest 20% (SI.DST.FRST.20) Measures income share of the bottom 20%. Like the lowest 10%, it signals income inequality and the economic conditions of disadvantaged groups.

In Exploratory Analysis, we will be examining these indicators in turn, and what they reveal about the overarching question. We will be examining the data via plots and graphs.

In Quantitative Analysis, we will be tacking the question head-on, trying to unearth the underlying reasons behind our suspicions.

Cleaning¶

We will be using the following datasets for this question

  1. https://data.worldbank.org/indicator/SI.POV.GINI

    • (Global), Gini Index
  2. https://data.worldbank.org/indicator/SI.DST.10TH.10

    • (Global), Income share held by highest 10%
  3. https://data.worldbank.org/indicator/SI.DST.05TH.20

    • (Global), Income share held by highest 20%
  4. https://data.worldbank.org/indicator/SI.DST.FRST.10

    • (Global), Income share held by lowest 10%
  5. https://data.worldbank.org/indicator/SI.DST.FRST.20

    • (Global), Income share held by lowest 20%
  6. https://data.worldbank.org/indicator/NY.GDP.MKTP.CD

    • (Global), GDP (Current US$)
  7. https://data.worldbank.org/indicator/SP.POP.TOTL?name_desc=false

    • (Global), Total Population
  8. https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS?_gl=1

    • (Global), Government expenditure on education, total (% of GDP)
In [43]:
gini = clean_df(df16.copy())
top_10 = clean_df(df17.copy())
top_20 = clean_df(df18.copy())
bottom_20 = clean_df(df19.copy())
bottom_10 = clean_df(df20.copy())
edu_expense = global_edu_expense_percent_gdp * global_gdp / 100
edu_expense_per_capita = edu_expense / global_pop
Exploratory Analysis¶

Firstly, lets look at some mean income inequality indicators over that past decade

In [44]:
mean_gini = gini.iloc[-10:, :].mean()
mean_top_10 = top_10.iloc[-10:, :].mean()
mean_top_20 = top_20.iloc[-10:, :].mean()
mean_bottom_20 = bottom_20.iloc[-10:, :].mean()
mean_bottom_10 = bottom_10.iloc[-10:, :].mean()
merged_mean = pd.DataFrame(data={"Gini Index": mean_gini, "Top 10%": mean_top_10, "Top 20%": mean_top_20, "Bottom 10%": mean_bottom_10, "Bottom 20%": mean_bottom_20})
fig = sns.violinplot(merged_mean)
plt.ylabel('Number of Countries')
plt.title("Boxplots of Income Inequality Indicators")
plt.show()
No description has been provided for this image

The graph above shows how there is a clear disparity between the Income groups. What we will investigate in the following sections is how Education Expenditure correlates with this disparity

In [45]:
map_data = gini.ffill().reset_index().melt(id_vars='index').rename(columns={"variable": "iso_code", "index": "year"})
map_data['country_name'] = map_data['iso_code'].map(country_codes_dict)
map_data = map_data[map_data.year.dt.year == 2023]
fig = px.choropleth(map_data,               
              locations="iso_code",               
              color="value",
              hover_name="country_name",  
              color_continuous_scale=px.colors.diverging.RdBu_r,
              height=600,
              range_color=[20, 60]           
)
fig.update_layout(
    title_text='Gini Index in 2023',  # Set your title here
    title_x=0.5  # Center the title (optional)
)
fig.show()

map_data = top_10.ffill().reset_index().melt(id_vars='index').rename(columns={"variable": "iso_code", "index": "year"})
map_data['country_name'] = map_data['iso_code'].map(country_codes_dict)
map_data = map_data[map_data.year.dt.year == 2023]
fig = px.choropleth(map_data,               
              locations="iso_code",               
              color="value",
              hover_name="country_name",  
              color_continuous_scale=px.colors.diverging.RdBu_r,
              height=600,
              range_color=[10, 50]           
)
fig.update_layout(
    title_text='Top 10% Share in 2023',  # Set your title here
    title_x=0.5  # Center the title (optional)
)
fig.show()

The graph above depicts how developing countries tend to have higher Gini coefficients and Top 10% Share than developed countries. This means income inequality is often more pronounced in developing countries.

However, it’s important to note that there are exceptions. Some developed countries, like the United States (0.41), have relatively high Gini coefficients compared to other developed nations.

Now, lets turn our attention to the predictor variable, Educational Expenditure per Capita. As in Question 1, we will use Educational Expenditure per Capita instead of Educational Expenditure to fairly assess the impact of investing in education on Income Inequality

In [46]:
fig = plt.figure()
 
ax0 = fig.add_subplot(1, 2, 1)
ax1 = fig.add_subplot(1, 2, 2)

# Subplot 1: Box plot
edu_expense_per_capita.iloc[-20:, :].mean().plot(kind='box', color='blue', vert=False, figsize=(20, 8), ax=ax0)  # add to subplot 1
ax0.set_title('Box Plot of Global Education Expenditure per Capita')
ax0.set_xlabel('Education Expenditure per Capita')

# Subplot 2: Histogram plot
edu_expense_per_capita.iloc[-20:, :].mean().plot(kind='hist', figsize=(20, 8), ax=ax1)  # add to subplot 2
ax1.set_title('Distribution of Global Education Expenditure per Capita')
ax1.set_ylabel('Frequency')
ax1.set_xlabel('Education Expenditure per Capita')

fig.suptitle('Global Education Expenditure (per Capita) Analysis', fontsize=16)

plt.show()
No description has been provided for this image

Clearly, Educational Expenditure per Capita is heavily skewed to the right. We will now proceed to investigate this further.

The graph below depicts a treemap of Mean Education Expenditure per Capita over the Last Decade. This will allow us to glimpse the countries with high/low Mean Education Expenditure per Capita over the Last Decade.

In [47]:
treemap_data = edu_expense_per_capita.iloc[-10:, :].mean().reset_index().rename(columns={"index": "iso_code", 0:'value'})
treemap_data['country_name'] = treemap_data['iso_code'].map(country_codes_dict)
fig = px.treemap(
    treemap_data,
    path=['iso_code'],  # Path defines the hierarchy (here we only use Country)
    values='value',      # The size of the rectangles will be based on GDP
    color='value', # The color of the rectangles will be based on Population
    hover_data=['country_name', 'value'],  # Data to show on hover
    color_continuous_scale='Plasma', # Diverging color scale
    title="Mean Education Expenditure per Capita over the Last Decade"
)
fig.show()

From the graph above, we can begin to suspect that countries that have low population tend to have high Mean Education Expenditure per Capita and countries that are underdeveloped tend to have very low Mean Education Expenditure per Capita.

To confirm this, let us look at a Chloropleth graph displaying Education Expenditure per Capita in 2023

In [48]:
map_data = edu_expense_per_capita.ffill().reset_index().melt(id_vars='index').rename(columns={"variable": "iso_code", "index": "year"})
map_data = map_data[map_data.year.dt.year == 2023]
map_data['country_name'] = map_data['iso_code'].map(country_codes_dict)
fig = px.choropleth(map_data,               
              locations="iso_code",               
              color="value",
              hover_name="country_name", 
              color_continuous_scale=px.colors.diverging.RdBu,
              height=600,
              range_color=[0, 2000]           
)
fig.update_layout(
    title_text='Education Expenditure per Capita in 2023',  # Set your title here
    title_x=0.5  # Center the title (optional)
)
fig.show()

From the graph above, we can see that North America and most of Europe tend to have high Education Expenditure per Capita. On the other hand, most of Africa has very low Education Expenditure per Capita. Therefore, the graph above confirms that the development of a country has a strong correlation with its Education Expenditure per Capita. However, one thing to note is that the difference in Education Expenditure per Capita between highly developed and underdeveloped countries is quite large (80-90 times).

Finally, the graph below depicts a plot of Gini Index over time, for various countries. Additionally, the countries have been coloured such that the ones with low educational expenditure are colored red while the ones with high educational expenditure are colored blue

In [49]:
plt.figure(figsize=(9, 7))

inv_color_dict = edu_expense_per_capita.iloc[-10:, :].mean().dropna().sort_values(ascending=False).reset_index().drop(0, axis=1).to_dict()['index']
keys = np.array(list(inv_color_dict.keys()))
values = np.array(list(inv_color_dict.values()))
color_dict = dict(zip(values, keys))

# Get a colormap
cmap = plt.get_cmap('coolwarm')
norm = plt.Normalize(0, len(color_dict))

# Plot each country's Gini index as a scatter plot, only if the country exists in color_dict
for country in gini.columns:
    if country in color_dict:  # Check if the country is in the color_dict
        color_value = norm(color_dict[country])  # Normalize the country index
        sns.scatterplot(x=gini.index.year, y=gini[country], color=cmap(color_value), label=country)

# Add a legend
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', ncol=6)
plt.ylabel("Gini Index")
plt.xlabel("Time")
plt.show()
No description has been provided for this image

The graph depicts a transition from red or grey to blue as we move downwards. This hints at a positive correlation between Educational Expenditure and Gini Index, something we will investigate in more detail in the Quantitative Analysis section

Quantitative Analysis¶
In [50]:
gini_melted = gini.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": 'iso_code', 'value': 'gini_index'})
top_10_melted = top_10.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": 'iso_code', 'value': 'top_10'})
top_20_melted = top_20.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": 'iso_code', 'value': 'top_20'})
bottom_20_melted = bottom_20.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": 'iso_code', 'value': 'bottom_20'})
bottom_10_melted = bottom_10.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": 'iso_code', 'value': 'bottom_10'})
edu_expense_melted = edu_expense.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": 'iso_code', 'value': 'edu_expense'})
edu_expense_per_capita_melted = edu_expense_per_capita.reset_index().melt(id_vars='index').rename(columns={"index": "year", "variable": 'iso_code', 'value': 'edu_expense_per_capita'})

q2_df = pd.merge(gini_melted, top_10_melted, on=['iso_code', 'year'], how='outer')
q2_df = pd.merge(q2_df, top_20_melted, on=['iso_code', 'year'], how='outer')
q2_df = pd.merge(q2_df, bottom_20_melted, on=['iso_code', 'year'], how='outer')
q2_df = pd.merge(q2_df, bottom_10_melted, on=['iso_code', 'year'], how='outer')
q2_df = pd.merge(q2_df, edu_expense_melted, on=['iso_code', 'year'], how='outer')
q2_df = pd.merge(q2_df, edu_expense_per_capita_melted, on=['iso_code', 'year'], how='outer')

q2_df_raw = q2_df
q2_df = q2_df.dropna().reset_index(drop=True)
q2_df
Out[50]:
year iso_code gini_index top_10 top_20 bottom_20 bottom_10 edu_expense edu_expense_per_capita
0 2000-01-01 AGO 51.9 40.2 56.1 3.2 1.0 2.380569e+08 14.520925
1 2018-01-01 AGO 51.3 39.6 55.6 3.8 1.3 1.624767e+09 51.953409
2 1996-01-01 ALB 27.0 20.7 36.0 9.2 3.9 9.866124e+07 31.142744
3 2002-01-01 ALB 31.7 25.3 40.3 8.4 3.5 1.355641e+08 44.432523
4 2005-01-01 ALB 30.6 24.3 39.2 8.4 3.5 2.642329e+08 87.741682
... ... ... ... ... ... ... ... ... ...
1613 2010-01-01 ZMB 52.0 41.0 57.5 4.1 1.6 6.922715e+08 50.193386
1614 2015-01-01 ZMB 55.8 43.4 60.2 3.1 1.2 9.827264e+08 60.482059
1615 2022-01-01 ZMB 51.5 39.1 56.4 3.9 1.5 1.044938e+09 52.200782
1616 2011-01-01 ZWE 43.2 33.8 49.7 5.8 2.5 7.160955e+08 54.975230
1617 2017-01-01 ZWE 44.3 34.8 51.1 6.0 2.5 1.023226e+09 69.366084

1618 rows × 9 columns

The figure below depicts a heatmap of the Pearson Correlation Coefficients between variables in our analysis

In [51]:
plt.figure(figsize=(8, 6))
sns.heatmap(q2_df.iloc[:, 2:].corr(), annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Heatmap of Correlation Between Numerical Variables')
plt.show()
No description has been provided for this image

Notice how edu_expense_per_capita has a somewhat strong negative correlation with gini_index (along with top_10 and top_20). Although this is definitely enough to claim that Educational Expenditure (per Capita) plays a role in reducing income inequality, let us delve into this deeper.

The graph below depicts a scatterplot of GNI Index against Educational Expenditure (per Capita). The graph also shows the corresponding linear regression line in red.

In [52]:
fig = sns.regplot(data=q2_df, x='edu_expense_per_capita', y='gini_index', marker='.', line_kws={"color": "#FF6666"})
plt.ylabel("Gini Index")
plt.xlabel("Edu Expenditure per Capita")
plt.title("Gini Index vs Educational Expenditure")
plt.show()
No description has been provided for this image

Note the quartile values of Gini Index, listed below.

In [53]:
q2_df.describe()[['gini_index']]
Out[53]:
gini_index
count 1618.000000
mean 36.674845
min 20.700000
25% 30.400000
50% 34.800000
75% 41.575000
max 64.800000
std 8.357212

There are now 2 things to note from the scatterplot. Firstly, note how all the countries with Gini Index greater that the 3rd Quartile have very low Educational Expenditure per Capita. Secondly, we notice that these values deviate significantly from our regression line. We will now proceed to investigate.

The code below analyses the countries with Gini Index greater that the 3rd Quartile, and Educational Expenditure per Capita less that $800 USD.

In [54]:
outliers = q2_df[(q2_df.gini_index > q2_df.gini_index.quantile(0.75)) & (q2_df.edu_expense_per_capita < 800)].groupby(['iso_code'])[['gini_index', 'edu_expense_per_capita']].mean()
outliers['GDP_per_capita'] = outliers.index.map(global_gdp_per_capita.mean(axis=0).to_dict())
outliers['GDP_per_capita'].plot(kind='hist')
plt.title('Distribution of GDP per Capita among Outliers')
plt.xlabel('GDP per Capita (USD)')
plt.ylabel('Number of Countries')
plt.show()
No description has been provided for this image

We see that most of such outliers are underdeveloped countries, with very low GDP per Capita. Such underdeveloped countries are subject to high volatility. Additionally, many other factors are at play in determining their Gini Index, such as government corruption, resources. Consequently, such countries may deviate quite significantly from our regression line.

Findings¶
  1. A country's Educational Expenditure per Capita has a strong negative correlation on its Gini Index
  2. All countries with Gini Index greater than the 75th percentile have very low Educational Expenditure per Capita

Question 3¶

Overview¶

In this section, we aim to address the question, "What predictive trends can be identified regarding the future impact of current education spending in Singapore on the economic well-being of its population?". To achieve this, we will forecast two key variables: GDP Growth (Annual %) and Gini Index.

The (Multiple Linear Regression Model) MLM for GDP Growth (Annual %) will incorporate Population, Educational Expenditure as a percent of GDP, Educational Expenditure per Capita, GDP per Capita as predictors. By providing inputs for Educational Expenditure as a percent of GDP, we will explore how it influences Singapore’s future GDP. Furthermore, based on our findings from Question 1, we will exclude countries with very low Population or GDP per Capita from the training data to enhance the robustness of the model.

The (Linear Regression Model) LM for Gini Index will focus on predicting income inequality, using Educational Expenditure per Capita as a key predictor, in line with the insights derived from our analysis in Question 2.

Cleaning¶

Firstly, lets combine the datasets from Questions 1 and 2 so that we have all the variable necessary to create our model in a single DataFrame. Note that we are combining the raw datasets, as we cannot afford to lose all the rows with missing values while training our model.

In [55]:
q3_df_raw = q1_df_raw.rename(columns={"country_code": "iso_code"}).drop(['edu_expense', 'edu_expense_per_capita'], axis=1).merge(q2_df_raw, on=['iso_code', 'year'], how='outer')
q3_df_raw
Out[55]:
year iso_code population GDP GDP_per_capita edu_expense_per_GDP country_name gini_index top_10 top_20 bottom_20 bottom_10 edu_expense edu_expense_per_capita
0 1960-01-01 ABW 54608.0 NaN NaN NaN Aruba NaN NaN NaN NaN NaN NaN NaN
1 1961-01-01 ABW 55811.0 NaN NaN NaN Aruba NaN NaN NaN NaN NaN NaN NaN
2 1962-01-01 ABW 56682.0 NaN NaN NaN Aruba NaN NaN NaN NaN NaN NaN NaN
3 1963-01-01 ABW 57475.0 NaN NaN NaN Aruba NaN NaN NaN NaN NaN NaN NaN
4 1964-01-01 ABW 58178.0 NaN NaN NaN Aruba NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13755 2019-01-01 ZWE 15354608.0 2.183223e+10 1421.868596 NaN Zimbabwe 50.3 NaN NaN NaN NaN NaN NaN
13756 2020-01-01 ZWE 15669666.0 2.150970e+10 1372.696674 NaN Zimbabwe NaN NaN NaN NaN NaN NaN NaN
13757 2021-01-01 ZWE 15993524.0 2.837124e+10 1773.920411 NaN Zimbabwe NaN NaN NaN NaN NaN NaN NaN
13758 2022-01-01 ZWE 16320537.0 2.736663e+10 1676.821489 NaN Zimbabwe NaN NaN NaN NaN NaN NaN NaN
13759 2023-01-01 ZWE 16665409.0 2.653827e+10 1592.416574 NaN Zimbabwe NaN NaN NaN NaN NaN NaN NaN

13760 rows × 14 columns

In [56]:
print("\n\033[1mOriginal dataset\033[0m\n")
q3_df_raw.info()
print("\n\033[1mDataset after dropping rows with null values\033[0m\n")
q3_df_raw.dropna().info()
Original dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13760 entries, 0 to 13759
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   year                    13760 non-null  datetime64[ns]
 1   iso_code                13760 non-null  object        
 2   population              13730 non-null  float64       
 3   GDP                     11113 non-null  float64       
 4   GDP_per_capita          11113 non-null  float64       
 5   edu_expense_per_GDP     4927 non-null   float64       
 6   country_name            13760 non-null  object        
 7   gini_index              2099 non-null   float64       
 8   top_10                  2098 non-null   float64       
 9   top_20                  2098 non-null   float64       
 10  bottom_20               2098 non-null   float64       
 11  bottom_10               2098 non-null   float64       
 12  edu_expense             4863 non-null   float64       
 13  edu_expense_per_capita  4863 non-null   float64       
dtypes: datetime64[ns](1), float64(11), object(2)
memory usage: 1.5+ MB

Dataset after dropping rows with null values

<class 'pandas.core.frame.DataFrame'>
Index: 1618 entries, 168 to 13753
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   year                    1618 non-null   datetime64[ns]
 1   iso_code                1618 non-null   object        
 2   population              1618 non-null   float64       
 3   GDP                     1618 non-null   float64       
 4   GDP_per_capita          1618 non-null   float64       
 5   edu_expense_per_GDP     1618 non-null   float64       
 6   country_name            1618 non-null   object        
 7   gini_index              1618 non-null   float64       
 8   top_10                  1618 non-null   float64       
 9   top_20                  1618 non-null   float64       
 10  bottom_20               1618 non-null   float64       
 11  bottom_10               1618 non-null   float64       
 12  edu_expense             1618 non-null   float64       
 13  edu_expense_per_capita  1618 non-null   float64       
dtypes: datetime64[ns](1), float64(11), object(2)
memory usage: 189.6+ KB

Clearly, there are a lot of missing values. Additionally, we cannot afford to drop all the missing rows as that results in a huge loss of available data. However, since these are socio-economic indicators, they are very unlikely to vary much within a decade. Therefore, we will employ Forward Fill to fill in these missing values, till a limit of 10 entries. Afterwards, we will melt and merge these new datasets to get the dataset we need to train our model.

In [57]:
gini_filled = gini.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'gini_index'})
global_pop_filled = global_pop.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'population'})
global_gdp_filled = global_gdp.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'gdp'})
global_gdp_per_capita_filled = global_gdp_per_capita.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'gdp_per_capita'})
global_edu_expense_percent_gdp_filled = global_edu_expense_percent_gdp.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'edu_expense_percent_gdp'})
top_10_filled = top_10.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'top_10_income_share'})
top_20_filled = top_20.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'top_20_income_share'})
bottom_20_filled = bottom_20.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'bottom_20_income_share'})
bottom_10_filled = bottom_10.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'bottom_10_income_share'})
edu_expense_filled = edu_expense.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'edu_expense'})
edu_expense_per_capita_filled = edu_expense_per_capita.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'edu_expense_per_capita'})
global_gdp_growth_percent_filled = global_gdp_growth_percent.ffill(limit=10).reset_index().rename(columns={"index": "year"}).melt(id_vars='year').rename(columns={'variable': 'iso_code', 'value': 'gdp_growth'})

The code below prepares the data for the GDP per Capita MLM. We drop the values with null entries after our forward-fill procedure, along with filtering out the countries with low Population or GDP per Capita

In [58]:
gdp_per_capita_mlm_data = global_pop_filled.merge(global_edu_expense_percent_gdp_filled, on=['iso_code', 'year'], how='outer') \
                                           .merge(global_gdp_per_capita_filled, on=['iso_code', 'year'], how='outer') \
                                           .merge(global_gdp_growth_percent_filled, on=['iso_code', 'year'], how='outer') \
                                           .merge(edu_expense_per_capita_filled, on=['iso_code', 'year'], how='outer')

gdp_per_capita_mlm_data.dropna(inplace=True)
gdp_per_capita_mlm_data = gdp_per_capita_mlm_data[(gdp_per_capita_mlm_data.population >= 5e6) & (gdp_per_capita_mlm_data.gdp_per_capita >= 5e3)].reset_index(drop=True)
gdp_per_capita_mlm_data
Out[58]:
year iso_code population edu_expense_percent_gdp gdp_per_capita gdp_growth edu_expense_per_capita
0 2012-01-01 AGO 25188292.0 3.08200 5083.826873 8.542107 156.683545
1 2013-01-01 AGO 26147002.0 3.87000 5061.349253 4.954613 195.874210
2 2014-01-01 AGO 27128337.0 2.93000 5011.984427 4.822559 146.851147
3 2019-01-01 ARE 9211657.0 3.86021 45376.170839 1.108348 1751.615458
4 2020-01-01 ARE 9287289.0 3.98418 37629.174168 -4.957052 1499.214021
... ... ... ... ... ... ... ...
1527 2019-01-01 ZAF 58087055.0 5.92770 6702.526617 0.259936 397.305673
1528 2020-01-01 ZAF 58801927.0 6.18336 5753.066494 -5.963358 355.732818
1529 2021-01-01 ZAF 59392255.0 6.56206 7073.612754 4.703062 464.174705
1530 2022-01-01 ZAF 59893885.0 6.18348 6766.481254 1.910406 418.404001
1531 2023-01-01 ZAF 60414495.0 6.60400 6253.161613 0.601662 412.958799

1532 rows × 7 columns

The code below prepares the data for the Gini Index LM. We drop the values with null entries after our forward-fill procedure.

In [59]:
gini_index_mlm_data = edu_expense_per_capita_filled.merge(gini_filled, on=['iso_code', 'year'], how='outer')
gini_index_mlm_data = gini_index_mlm_data.dropna().reset_index(drop=True)
gini_index_mlm_data
Out[59]:
year iso_code edu_expense_per_capita gini_index
0 2000-01-01 AGO 14.520925 51.9
1 2001-01-01 AGO 14.520925 51.9
2 2002-01-01 AGO 14.520925 51.9
3 2003-01-01 AGO 14.520925 51.9
4 2004-01-01 AGO 14.520925 51.9
... ... ... ... ...
4203 2019-01-01 ZWE 46.529246 50.3
4204 2020-01-01 ZWE 46.529246 50.3
4205 2021-01-01 ZWE 46.529246 50.3
4206 2022-01-01 ZWE 46.529246 50.3
4207 2023-01-01 ZWE 46.529246 50.3

4208 rows × 4 columns

Model Creation and Evaluation¶

The cell below creates a Multiple Linear Regression Model (MLM) for predicting GDP Growth based on Population, Educational Expenditure (percent GDP), Educational Expenditure (per Capita) and GDP per Capita

In [60]:
gdp_per_capita_mlm = LinearRegression()
X_train_gdp, X_test_gdp, y_train_gdp, y_test_gdp = train_test_split(gdp_per_capita_mlm_data[['population', 'edu_expense_percent_gdp', 'gdp_per_capita', 'edu_expense_per_capita']], gdp_per_capita_mlm_data['gdp_growth'], test_size=0.2, random_state=0)
gdp_per_capita_mlm.fit(X_train_gdp, y_train_gdp)
Out[60]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [61]:
coefficients = gdp_per_capita_mlm.coef_
intercept = gdp_per_capita_mlm.intercept_

# Print the equation for GDP growth
equation = "gdp_growth = {:.2e} * population + {:.2e} * edu_expense_percent_gdp + {:.2e} * gdp_per_capita + {:.2e} * edu_expense_per_capita + {:.4f}".format(
    coefficients[0], coefficients[1], coefficients[2], coefficients[3], intercept)

print("The equation for GDP growth is:")
print(equation)
The equation for GDP growth is:
gdp_growth = 1.94e-09 * population + -4.21e-01 * edu_expense_percent_gdp + -2.45e-05 * gdp_per_capita + 2.03e-04 * edu_expense_per_capita + 4.8444

Below is a scatterplot displaying predicted values of GDP Growth against actual values of GDP Growth

In [62]:
y_hat_gdp = gdp_per_capita_mlm.predict(X_test_gdp)
plt.scatter(y_test_gdp, y_hat_gdp, marker='.')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Predicted vs Actual')
plt.show()
No description has been provided for this image

From the scatterplot, we can see that there are quite a few outliers, with GDP Growth values below -5 or above 10. Such values are quite definitely outliers. Although GDP Growth is definitely subject to much volatility, as a result of wars, natural disasters, etc., we definitely cannot take these values into account in our model.

The cell below display the Mean Squared Error (MSE) of this model. Let us now proceed to remove the outliers.

In [63]:
print("MSE: ", mean_squared_error(y_test_gdp, y_hat_gdp))
MSE:  11.769396860956444
In [64]:
gdp_per_capita_mlm_data = gdp_per_capita_mlm_data[abs(gdp_per_capita_mlm_data.gdp_growth) <= 5]
gdp_per_capita_mlm = LinearRegression()
X_train_gdp, X_test_gdp, y_train_gdp, y_test_gdp = train_test_split(gdp_per_capita_mlm_data[['population', 'edu_expense_percent_gdp', 'gdp_per_capita', 'edu_expense_per_capita']], gdp_per_capita_mlm_data['gdp_growth'], test_size=0.2, random_state=0)
gdp_per_capita_mlm.fit(X_train_gdp, y_train_gdp)
y_hat_gdp = gdp_per_capita_mlm.predict(X_test_gdp)
plt.scatter(y_test_gdp, y_hat_gdp, marker='.')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
No description has been provided for this image

From the above graph, we can see that our model has improved considerably. Additionally, the MSE (displayed below) has reduced. Therefore, let us proceed with the latter model.

In [65]:
print("MSE: ", mean_squared_error(y_test_gdp, y_hat_gdp))
MSE:  5.4142837248373885

Let us take a look at the Residual Plots for our latest model.

In [66]:
residuals = y_test_gdp - y_hat_gdp
 
plt.figure(figsize=(8,6))
sns.scatterplot(x=y_hat_gdp, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')  # Add a horizontal line at y=0
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()
No description has been provided for this image
In [67]:
fig, axs = plt.subplots(2, 2, figsize=(15, 10))

axs = axs.flatten()

y_pred_gdp = gdp_per_capita_mlm.predict(X_test_gdp)
residuals_gdp = y_test_gdp - y_pred_gdp

for i, ax in enumerate(axs):
    if i < X_test_gdp.shape[1]:
        ax.scatter(X_test_gdp.iloc[:, i], residuals_gdp, color='blue', alpha=0.6)
        ax.axhline(y=0, color='r', linestyle='--')
        ax.set_xlabel(X_test_gdp.columns[i])
        ax.set_ylabel('Residuals')
        ax.set_title(f'Residuals vs. {X_test_gdp.columns[i]}')

plt.tight_layout()
plt.show()
No description has been provided for this image

All of the residual plots above are fairly evenly distributed. However, take note of the y-axis. the negative residuals are more sparse but have higher deviation, while the positive residuals are more common, but have lower deviation.

The cell below initiates a Linear Regression Model (LM) to predict the Gini Index of a country based on its Educational Expenditure per Capita.

In [68]:
gini_lm = LinearRegression()
X_train_gini, X_test_gini, y_train_gini, y_test_gini = train_test_split(gini_index_mlm_data[['edu_expense_per_capita']], gini_index_mlm_data['gini_index'], test_size=0.2, random_state=0)
gini_lm.fit(X_train_gini, y_train_gini)
Out[68]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [69]:
gini_coefficient = gini_lm.coef_[0]
gini_intercept = gini_lm.intercept_

gini_equation = "gini_index = {:.2e} * edu_expense_per_capita + {:.2f}".format(gini_coefficient, gini_intercept)

print("The equation for predicting Gini index is:")
print(gini_equation)
The equation for predicting Gini index is:
gini_index = -3.42e-03 * edu_expense_per_capita + 40.67

The equation above depicts how (according to the LM) a higher Educational Expenditure per Capita gradually decreases the Gini Index. The graph below shows a scatterplot of predicted Gini Index values vs actual ones

In [70]:
y_hat_gini = gini_lm.predict(X_test_gini)
plt.scatter(y_test_gini, y_hat_gini, marker='.')
plt.plot([20, 40], [20, 40], color="red")
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
No description has been provided for this image
In [71]:
print("Mean Squared Error: ", mean_squared_error(y_hat_gini, y_test_gini))
print("R-squared value: ", gini_lm.score(X_train_gini, y_train_gini))
Mean Squared Error:  69.04775961971771
R-squared value:  0.14890953856695732

The scatterplot above depicts how most of the deviation from our model occurs when the actual Gini Index is quite high (above the 3rd Quartile). However, as we had discussed in Question 2, only countries that are underdeveloped have Gini Index above the 3rd Quartile. Therefore, let us remove these countries from the dataset and recreate our model.

In [72]:
gini_index_mlm_data = gini_index_mlm_data[gini_index_mlm_data['gini_index'] <= gini_index_mlm_data['gini_index'].quantile(0.75)]
gini_lm = LinearRegression()
X_train_gini, X_test_gini, y_train_gini, y_test_gini = train_test_split(gini_index_mlm_data[['edu_expense_per_capita']], gini_index_mlm_data['gini_index'], test_size=0.2, random_state=0)
gini_lm.fit(X_train_gini, y_train_gini)

y_hat_gini = gini_lm.predict(X_test_gini)
plt.scatter(y_test_gini, y_hat_gini, marker='.')
plt.plot([22, 36], [22, 36], color='red')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
No description has been provided for this image
In [73]:
print("Mean Squared Error: ", mean_squared_error(y_hat_gini, y_test_gini))
print("R-squared value: ", gini_lm.score(X_train_gini, y_train_gini))
Mean Squared Error:  25.653538915221688
R-squared value:  0.16104166357384508

Clearly, this improves our model considerably. Therefore, let us proceed with this improved model.

The below displays a residual plot

In [74]:
sns.residplot(x=gini_index_mlm_data['edu_expense_per_capita'], y=gini_index_mlm_data['gini_index'])
plt.xlabel('Educational Expenditure per Capita')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()
No description has been provided for this image

Note that when Educational Expenditure per Capita increases, the residuals tend to be positive. Therefore, we can infer that developed countries will probably have lower Gini Index than the one that we predict.

Model Application¶

In this section, we will apply the models previously developed and try to forecast Singapore's Economic Welfare based on different Educational Expenditure policies. We will do this by forecasting 2 key variables, GDP and Gini Index, for the next 50 years.

We will first predict GDP using our MLM, and then use the predicted Educational Expenditure per Capita to predict the Gini Index using our LM.

In [75]:
sgp_data = pd.merge(global_gdp_per_capita.loc[:, 'SGP'].rename('gdp_per_capita'), global_pop.loc[:, 'SGP'].rename('population'), left_index=True, right_index=True)
sg_edu_expense_per_capita = sg_edu_expense['Expenditure On Education'] / global_pop.loc[:, 'SGP']
sgp_data = pd.merge(sgp_data, sg_edu_expense_per_capita.rename("edu_expense_per_capita"), left_index=True, right_index=True)
sg_edu_expense_per_gdp = sg_edu_expense['Expenditure On Education'] / sg_gdp['GDP'] * 100
sgp_data = pd.merge(sgp_data, sg_edu_expense_per_gdp.rename("edu_expense_percent_gdp"), left_index=True, right_index=True)

sgp_data = sgp_data[['population', 'edu_expense_percent_gdp', 'gdp_per_capita', 'edu_expense_per_capita']]
sgp_data
Out[75]:
population edu_expense_percent_gdp gdp_per_capita edu_expense_per_capita
1960-01-01 1646400.0 NaN 428.056183 29.090343
1961-01-01 1702400.0 NaN 449.148137 34.160280
1962-01-01 1750200.0 NaN 472.082740 42.244875
1963-01-01 1795000.0 NaN 511.202235 46.881694
1964-01-01 1841600.0 NaN 485.530686 50.922361
... ... ... ... ...
2019-01-01 5703569.0 2.463315 66081.719924 1739.990395
2020-01-01 5685807.0 2.523389 61466.803676 1681.851906
2021-01-01 5453566.0 2.330660 79601.412962 1846.448405
2022-01-01 5637022.0 1.942417 88428.702423 1807.197737
2023-01-01 5917648.0 NaN 84734.255921 NaN

64 rows × 4 columns

The graph below depicts a Small Multiples plot of the data above.

In [76]:
nrow=2
ncol=2
 
fig, axes = plt.subplots(nrow, ncol,figsize=(16,12))
 
titles = ['Population', 'Edu Expense (% GDP)', 'GDP per Capita', 'Edu Expense per Capita']

# plot counter
count=0
for r in range(nrow):
    for c in range(ncol):
        sgp_data[sgp_data.columns[count]].plot(kind='line', ax=axes[r,c])
        axes[r,c].set_title(titles[count]) #give every subplot a title
        count+=1
        
plt.suptitle("Trends of Various Indicators in SGP over Iime") #You can add a title to the entire figure with the suptitle() function
plt.tight_layout()
plt.show()
No description has been provided for this image

All of the graphs above show a very clear (and similar) upwards trend, with the exception of Educational Expenditure (as a percent of GDP). Note that Singapore's Educational Expenditure (as a percent of GDP) is relatively low compared to other countries, based on our findings in Question 1. Therefore, we will adjust this value and forecast Singapore's future GDP using our MLM.

Below is a graph of Singapore's population growth year on year.

In [77]:
fig = (sgp_data['population'].pct_change() * 100).plot(kind='line')
mean_sgp_pop_growth = (sgp_data['population'].pct_change() * 100).mean()
last_15_sgp_pop_growth = (sgp_data['population'].pct_change()[-15:] * 100).mean()
plt.axhline(y=mean_sgp_pop_growth, color='r', linestyle='--')
plt.axhline(y=last_15_sgp_pop_growth, color='g', linestyle='--')
plt.text(54, mean_sgp_pop_growth-0.1, f'Mean Overall: {mean_sgp_pop_growth:.2f}%', color='r', fontsize=12)
plt.text(54, last_15_sgp_pop_growth-0.2, f'Last 15 Years: {last_15_sgp_pop_growth:.2f}%', color='g', fontsize=12)
plt.title('SGP Population Growth (Annual %)')
plt.show()
No description has been provided for this image

For the purposes of our model, we will assume that Singapore's Population Growth remains relatively unaffected by GDP. However, note that Therefore, we will model the population as having a steady growth of ~2% (as shown above) year on year.

Here is a detailed explanation of how our model will operate, along with the reasoning behind each step. First, we will set a value for Educational Expenditure as a percentage of GDP, which reflects the proportion of the national budget allocated to education.

Next, we will model the projected population growth for the next 50 years.

Our Multiple Linear Regression Model (MLM) will then come into play to predict GDP growth for each year, based on several critical variables:

  1. Previous year's Population
  2. Educational Expenditure as a percentage of GDP
  3. Educational Expenditure per Capita
  4. GDP per Capita

Once the model calculates the projected GDP Growth for a given year, we will use this growth rate to estimate the next year's GDP per Capita and Educational Expenditure per Capita.

This process will be repeated iteratively for each year over a 50-year period, allowing us to simulate how changes in education spending and population growth could influence Singapore's economic well-being over the long term.

The below cell describes the function for predicting Singapore's GDP trajectory based on an input variable of Educational Expenditure (percent GDP). A sample trajectory table is provided, with an input Educational Expenditure (percent GDP) of 4%

In [78]:
def predict_sgp(preset_edu_expense_per_gdp):
    sgp_data_predicted = sgp_data.iloc[[-1]].copy()
    sgp_data_predicted.loc[:, 'edu_expense_percent_gdp'] = preset_edu_expense_per_gdp
    sgp_data_predicted.loc[:, 'edu_expense_per_capita'] = preset_edu_expense_per_gdp * sgp_data_predicted.gdp_per_capita / 100

    for i in range(50):
        variables = sgp_data_predicted.iloc[[-1]].copy()
        new_gdp_growth = gdp_per_capita_mlm.predict(variables)
        new_gdp = variables.iloc[0, 2] * variables.iloc[0, 0] * (1+new_gdp_growth/100)
        new_pop = variables.iloc[0, 0] * (1+last_15_sgp_pop_growth/100)
        new_gdp_per_capita = new_gdp / new_pop
        new_edu_expense_per_capita = preset_edu_expense_per_gdp * new_gdp / (100 * new_pop)
        sgp_data_predicted.loc[sgp_data_predicted.index[-1] + pd.DateOffset(years=1)] = [new_pop, preset_edu_expense_per_gdp, new_gdp_per_capita[0], new_edu_expense_per_capita[0]]
    return sgp_data_predicted

predict_sgp(4).tail(5)
Out[78]:
population edu_expense_percent_gdp gdp_per_capita edu_expense_per_capita
2069-01-01 1.106090e+07 4.0 89905.235088 3596.209404
2070-01-01 1.121233e+07 4.0 90004.570101 3600.182804
2071-01-01 1.136583e+07 4.0 90103.367836 3604.134713
2072-01-01 1.152143e+07 4.0 90201.629706 3608.065188
2073-01-01 1.167916e+07 4.0 90299.357141 3611.974286

The graph below depicts GDP per Capita (in 2073) vs. Edu Expenditure (percent GDP)

In [79]:
import matplotlib.pyplot as plt

# Define the preset values
preset_list = [0, 2, 4, 6, 8, 10, 12, 15, 20]
gdp_changes = []

# Collect the last value of gdp_per_capita for each preset
for preset in preset_list:
    gdp_changes.append(predict_sgp(preset).gdp_per_capita.iloc[-1])

# Create the line chart
plt.figure(figsize=(10, 6))
plt.plot(preset_list, gdp_changes, marker='o', linestyle='-', color='b')

# Add labels and title
plt.xlabel('Edu Expenditure (percent GDP)')
plt.ylabel('GDP per Capita (in 2073)')
plt.title('GDP per Capita (in 2073) vs. Edu Expenditure (percent GDP)')

# Add a horizontal line showing the current GDP per Capita
plt.axhline(y=sgp_data.gdp_per_capita.iloc[-1], color='g', linestyle='--')

# Place a text annotation on the plot indicating the current GDP per Capita
plt.text(10, sgp_data.gdp_per_capita.iloc[-1] + 700,  # Adjust position to avoid overlap
         f'Current GDP per Capita: {sgp_data.gdp_per_capita.iloc[-1]:,.0f}', 
         color='g', fontsize=12)

# Optionally, add grid lines for better readability
plt.grid(True)

# Show the plot
plt.tight_layout()
plt.show()
No description has been provided for this image

Clearly, our model also predicts that GDP per Capita (50 years down the line) will rise when Edu Expenditure (percent GDP) rises. This confirms our initial findings.

Finally, lets plot a similar line chart of the predicted Gini Index in 50 years as we vary the Educational Expenditure (% GDP)

In [80]:
gini_indexes = []
preset_list = [0, 2, 4, 6, 8, 10]


for preset in preset_list:
    gini_indexes.append(gini_lm.predict(predict_sgp(preset)[['edu_expense_per_capita']].iloc[[-1]]))


# Create the line chart
plt.figure(figsize=(10, 6))
plt.plot(preset_list, gini_indexes, marker='o', linestyle='-', color='b')

# Add labels and title
plt.xlabel('Edu Expenditure (percent GDP)')
plt.ylabel('Gini Index (in 2073)')
plt.title('Gini Index (in 2073) vs. Edu Expenditure (percent GDP)')

# Optionally, add grid lines for better readability
plt.grid(True)

# Show the plot
plt.tight_layout()
plt.show()
No description has been provided for this image

Clearly, the above graph also confirms our findings in Question 2, that when Eductional Expenditure (% GDP) increases, the predicted Gini Index decreases.

Findings¶
  1. As Educational Expenditure increases, predicted GDP per Capita in 2073 also increases.
    • For every increase of 1% in Educational Expenditure (% GDP), the forecasted GDP per Capita in 2073 increases by $1382
  2. As Educational Expenditure increases, predicted Gini Index in 2073 decreases.
    • For every increase of 1% in Educational Expenditure (% GDP), the forecasted Gini Index in 2073 decreases by 2
  3. At Singapore's current Educational Expenditure (% GDP) (2.15%), the forecasted GDP per Capita in 2073 is $88507 USD
    • If Singapore were to spend the median Educational Expenditure (% GDP) (4%), the forecasted GDP per Capita in 2073 is $90299 USD
  4. At Singapore's current Educational Expenditure (% GDP) (2.15%), the forecasted Gini Index in 2073 is 32.5
    • If Singapore were to spend the median Educational Expenditure (% GDP) (4%), the forecasted GDP per Capita in 2073 is 29.0

Results Findings & Conclusion¶

For each research question, summarize in 2-3 visualizations which will answer the question. Intrepret the results accordingly and give your observation and conclusion. The visualizations should be well presented (apply what you have learnt in Chapter 9 on data communication). The plots shown here could be an enhanced version of the EDA plots, or presented in another format.

Question 1¶

What is the correlation between government expenditure on education and GDP growth?

In [81]:
# Create a new figure object for the subplots
fig = plt.figure()

# Add a subplot (ax0) on the left side (1 row, 2 columns, 1st subplot)
ax0 = fig.add_subplot(1, 2, 1)

# Add a subplot (ax1) on the right side (1 row, 2 columns, 2nd subplot)
ax1 = fig.add_subplot(1, 2, 2)

# Create a box plot of the average global education expenditure (% of GDP) over the last 20 years
# Use the color blue and make the plot horizontal, then place it in the first subplot (ax0)
global_edu_expense_percent_gdp.iloc[-20:, :].mean().plot(kind='box', color='blue', vert=False, figsize=(20, 8), ax=ax0)

# Set the title for the first subplot (ax0)
ax0.set_title('Box Plot of Global Education Expenditure (% of GDP)')

# Set the x-axis label for the first subplot (ax0)
ax0.set_xlabel('Education Expenditure (% of GDP)')

# Create a histogram of the average global education expenditure (% of GDP) over the last 20 years
# Place the histogram in the second subplot (ax1)
global_edu_expense_percent_gdp.iloc[-20:, :].mean().plot(kind='hist', figsize=(20, 8), ax=ax1)

# Set the title for the second subplot (ax1)
ax1.set_title('Distribution of Global Education Expenditure (% of GDP)')

# Set the y-axis label for the second subplot (ax1)
ax1.set_ylabel('Number of Countries')

# Set the x-axis label for the second subplot (ax1)
ax1.set_xlabel('Education Expenditure (% of GDP)')

# Set a main title for the entire figure
fig.suptitle('Global Education Expenditure Analysis', fontsize=16)


# Display the figure with both subplots
plt.show()
No description has been provided for this image

From the graphs above, it is quite clear that the vast majority of countries spend between 3-5% of their GDP on Education. However, note that this also means that most countries GDP are 20-30 times their Education Expenditure. Above all, this shows how a countries GDP is really very closely tied to its Education Expenditure.

In [82]:
# Fill forward missing values in the global_edu_expense_percent_gdp dataframe to handle missing data
map_data = global_edu_expense_percent_gdp.ffill().reset_index()

# Reshape the data using melt to make it long-form for plotting
# 'index' becomes 'year', and country ISO codes will be the 'iso_code'
map_data = map_data.melt(id_vars='index').rename(columns={"variable": "iso_code", "index": "year"})

# Map the ISO country codes to their corresponding country names using a dictionary
map_data['country_name'] = map_data['iso_code'].map(country_codes_dict)
map_data = map_data[map_data.year.dt.year==2023]

# Convert the 'year' column to only include the year (dropping month/day if present)
map_data.year = map_data.year.dt.year

# Create an animated choropleth map using Plotly Express
px.choropleth(
    map_data,                        # Input dataset
    locations="iso_code",            # The ISO codes to be used for location
    color="value",                   # Column to represent in color scale (education expenditure as % of GDP)
    hover_name="country_name",       # The name to display when hovering over a country
    color_continuous_scale=px.colors.diverging.RdBu,  # Color scale for the choropleth
    height=600,                      # Height of the figure
    range_color=[3, 7],               # Set the range for the color scale (3% to 7% for education expenditure)
    title="Edu Expenditure (% GDP) in 2023"
)

From the graph , we can see that most developed countries, such as USA, Australia, Sweden, Finland, France all have high education expenditure (blue) whereas most underdeveloped countries, such as countries in Africa and Asia have low education expenditure (brown). Although there are definitely quite a few exceptions, this hints at a trend between how developed a country is and its Education Expenditure (% GDP)

In [83]:
mean_edu_expense = pd.DataFrame(global_edu_expense_percent_gdp.iloc[-10:, :].mean(axis=0), columns=['Mean Education Expenditure (% of GDP)'])
mean_gdp_per_capita = pd.DataFrame(global_gdp_per_capita.iloc[-10:, :].mean(axis=0), columns=['Mean GDP Per Capita'])
mean_pop = pd.DataFrame(global_pop.iloc[-10:, :].mean(axis=0), columns=['Mean Population'])
merged_df = pd.merge(mean_edu_expense, mean_gdp_per_capita, left_index=True, right_index=True).merge(mean_pop, right_index=True, left_index=True)

def pearson_coef(min_pop, min_gdp_per_capita):
    return merged_df[(merged_df['Mean Population'] >= min_pop) & (merged_df['Mean GDP Per Capita'] >= min_gdp_per_capita)].corr().iloc[0, 1]

pop_ranges = range(10000000, 1000000, -1000000)
gdp_ranges = range(5000, 10001, 1000)

correlation_matrix = pd.DataFrame(index=pop_ranges, columns=gdp_ranges)

for min_pop in pop_ranges:
    for min_gdp in gdp_ranges:
        correlation_matrix.loc[min_pop, min_gdp] = pearson_coef(min_pop, min_gdp)

correlation_matrix = correlation_matrix.apply(pd.to_numeric, errors='coerce')

plt.figure(figsize=(12, 8))
 
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True,
            xticklabels=[int(gdp / 1000) for gdp in gdp_ranges], yticklabels=[int(pop / 1e6) for pop in pop_ranges])
plt.title('Correlation between Education Expenditure (% GDP) and GDP per Capita')
plt.xlabel('Min GDP per Capita Threshold (Thousand USD)')
plt.ylabel('Min Population Threshold (Millions)')
plt.text(-1, 5, 'Bigger Countries', rotation='vertical', color='g', fontsize=20)
plt.annotate('', xy=(-0.1, 0.9), xycoords='axes fraction', xytext=(-0.1, 0.1),
             arrowprops=dict(facecolor='g', arrowstyle='->', edgecolor='g', lw=2), fontsize=12, ha='center', va='center')
plt.text(2, 11, 'More Developed Countries', rotation='horizontal', color='g', fontsize=20)
plt.annotate('', xy=(0.9, -0.15), xycoords='axes fraction', xytext=(0.1, -0.15),
             arrowprops=dict(facecolor='g', arrowstyle='->', edgecolor='g', lw=2), fontsize=12, ha='center', va='center')

plt.show()
No description has been provided for this image

From the graph, we can see that as we increase the Min Population Threshold and the Min GDP per capita Threshold, the correlation between Education Expenditure as a percent of GDP and GDP per Capita gets stronger and stronger. In other words, this confirms that as countries get Bigger and More Developed, their GDP Growth is ever more reliant on the Education Expenditure

Question 2¶

To what extent does government expenditure on education reduce income inequality, as measured by the Gini coefficient?

In [84]:
mean_gini = gini.iloc[-10:, :].mean()
mean_top_10 = top_10.iloc[-10:, :].mean()
mean_top_20 = top_20.iloc[-10:, :].mean()
mean_bottom_20 = bottom_20.iloc[-10:, :].mean()
mean_bottom_10 = bottom_10.iloc[-10:, :].mean()
merged_mean = pd.DataFrame(data={"Gini Index": mean_gini, "Top 10%": mean_top_10, "Top 20%": mean_top_20, "Bottom 10%": mean_bottom_10, "Bottom 20%": mean_bottom_20})
fig = sns.violinplot(merged_mean)
plt.ylabel('Number of Countries')
plt.title("Boxplots of Income Inequality Indicators")
plt.show()
No description has been provided for this image

The above graph shows how there is indeed a very pressing problem of Income Inequality. The difference between the shares of the Top 10% and the Bottom 10% are quite vast. Should Education Expenditure prove to be a viable method of decreasing Income Inequality, the graph above makes it clear that investing in Education becomes all the more important.

In [85]:
plt.figure(figsize=(9, 7))

inv_color_dict = edu_expense_per_capita.iloc[-10:, :].mean().dropna().sort_values(ascending=False).reset_index().drop(0, axis=1).to_dict()['index']
keys = np.array(list(inv_color_dict.keys()))
values = np.array(list(inv_color_dict.values()))
color_dict = dict(zip(values, keys))

# Get a colormap
cmap = plt.get_cmap('coolwarm')
norm = plt.Normalize(0, len(color_dict))

# Plot each country's Gini index as a scatter plot, only if the country exists in color_dict
for country in gini.columns:
    if country in color_dict:  # Check if the country is in the color_dict
        color_value = norm(color_dict[country])  # Normalize the country index
        sns.scatterplot(x=gini.index.year, y=gini[country], color=cmap(color_value), label=country)

# Add a legend
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', ncol=6)
plt.ylabel("Gini Index")
plt.xlabel("Time")
plt.text(1965, 45, "Blue to Red", rotation='vertical', color='g', fontsize=20)
plt.arrow(1970, 40, 0, 25, color='g', width=0.7)
plt.show()
No description has been provided for this image

The dots in the graph above have been coloured such that countries with Low Educational Expenditure have been coloured Red, while the countries with High Educational Expenditure have been colored Blue. From the graph, we can see a clear trend of Red/Grey to Blue as we move downward, indicating how countries with Lower Gini Index tend to have Higher Educational Expenditure

In [86]:
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

circle1 = patches.Ellipse((400, 55), 1000, 23, color='g', fill=False)

# First subplot - Gini Index vs Educational Expenditure
sns.regplot(data=q2_df, x='edu_expense_per_capita', y='gini_index', marker='.', line_kws={"color": "#FF6666"}, ax=axes[0])
axes[0].set_ylabel("Gini Index")
axes[0].set_xlabel("Edu Expenditure per Capita")
axes[0].set_title("Gini Index vs Educational Expenditure")
axes[0].add_patch(circle1)
axes[0].text(1100, 55, "Outliers   ------------------------------>", color='g', fontsize=20)

# Second subplot - Distribution of GDP per Capita among Outliers
outliers = q2_df[(q2_df.gini_index > q2_df.gini_index.quantile(0.75)) & (q2_df.edu_expense_per_capita < 800)].groupby(['iso_code'])[['gini_index', 'edu_expense_per_capita']].mean()
outliers['GDP_per_capita'] = outliers.index.map(global_gdp_per_capita.mean(axis=0).to_dict())
outliers['GDP_per_capita'].plot(kind='hist', ax=axes[1])
axes[1].set_title('Distribution of GDP per Capita among Outliers')
axes[1].set_xlabel('GDP per Capita (USD)')
axes[1].set_ylabel('Number of Countries')
axes[1].text(3000, 15, "Very Low GDP Per Capita \n => Underdeveloped Countries", fontsize=15, color='g')

# Show the combined plot
plt.tight_layout()
plt.show()
No description has been provided for this image

The scatterplot above nicely displays how an increasing Education Expenditure certainly brings about a decreasing Gini Index. When we investigate the outliers, we discover that all of them are underdeveloped countries with very low GDP per Capita and a very low Education Expenditure. This further supports our findings that increasing Education Expenditure brings about a decreasing Gini Index.

Question 3¶

What predictive trends can be identified regarding the future impact of current education spending in Singapore on the economical well-being of its population?

We will forecast two key variables: GDP Growth (Annual %) and Gini Index.

The (Multiple Linear Regression Model) MLM for GDP Growth (Annual %) will incorporate Population, Educational Expenditure as a percent of GDP, Educational Expenditure per Capita, GDP per Capita as predictors. By providing inputs for Educational Expenditure as a percent of GDP, we will explore how it influences Singapore’s future GDP. Furthermore, based on our findings from Question 1, we will exclude countries with very low Population or GDP per Capita from the training data to enhance the robustness of the model.

The (Linear Regression Model) LM for Gini Index will focus on predicting income inequality, using Educational Expenditure per Capita as a key predictor, in line with the insights derived from our analysis in Question 2.

Here is a detailed explanation of how our model will operate, along with the reasoning behind each step. First, we will set a value for Educational Expenditure as a percentage of GDP, which reflects the proportion of the national budget allocated to education.

Next, we will model the projected population growth for the next 50 years.

Our Multiple Linear Regression Model (MLM) will then come into play to predict GDP growth for each year, based on several critical variables:

  1. Previous year's Population
  2. Educational Expenditure as a percentage of GDP
  3. Educational Expenditure per Capita
  4. GDP per Capita

Once the model calculates the projected GDP Growth for a given year, we will use this growth rate to estimate the next year's GDP per Capita and Educational Expenditure per Capita.

This process will be repeated iteratively for each year over a 50-year period, allowing us to simulate how changes in education spending and population growth could influence Singapore's economic well-being over the long term.

In [87]:
import matplotlib.pyplot as plt

# Define the preset values
preset_list = [0, 2, 4, 6, 8, 10, 12, 15, 20]
gdp_changes = []

# Collect the last value of gdp_per_capita for each preset
for preset in preset_list:
    gdp_changes.append(predict_sgp(preset).gdp_per_capita.iloc[-1])

# Create the line chart
plt.figure(figsize=(10, 6))
plt.plot(preset_list, gdp_changes, marker='o', linestyle='-', color='b')

# Add labels and title
plt.xlabel('Edu Expenditure (percent GDP)')
plt.ylabel('GDP per Capita (in 2073)')
plt.title('GDP per Capita (in 2073) vs. Edu Expenditure (percent GDP)')

# Add a horizontal line showing the current GDP per Capita
plt.axhline(y=sgp_data.gdp_per_capita.iloc[-1], color='g', linestyle='--')

# Place a text annotation on the plot indicating the current GDP per Capita
plt.text(10, sgp_data.gdp_per_capita.iloc[-1] + 700,  # Adjust position to avoid overlap
         f'Current GDP per Capita: {sgp_data.gdp_per_capita.iloc[-1]:,.0f}', 
         color='g', fontsize=12)

# Optionally, add grid lines for better readability
plt.grid(True)

# Show the plot
plt.tight_layout()
plt.show()
No description has been provided for this image

Clearly, our model predicts that GDP per Capita (50 years down the line) will rise when Edu Expenditure (percent GDP) rises. More concretely, for every increase of 1% in Educational Expenditure (% GDP), the forecasted GDP per Capita in 2073 increases by $1382. This shows how (at least according to this model), investing in Education has a big impact on Singapore's future GDP

In [88]:
gini_indexes = []
preset_list = [0, 2, 4, 6, 8, 10]


for preset in preset_list:
    gini_indexes.append(gini_lm.predict(predict_sgp(preset)[['edu_expense_per_capita']].iloc[[-1]]))


# Create the line chart
plt.figure(figsize=(10, 6))
plt.plot(preset_list, gini_indexes, marker='o', linestyle='-', color='b')

# Add labels and title
plt.xlabel('Edu Expenditure (percent GDP)')
plt.ylabel('Gini Index (in 2073)')
plt.title('Gini Index (in 2073) vs. Edu Expenditure (percent GDP)')

# Optionally, add grid lines for better readability
plt.grid(True)

# Show the plot
plt.tight_layout()
plt.show()
No description has been provided for this image

Our model also predicts that Gini Index (50 years down the line) will decrease when Edu Expenditure (percent GDP) rises. More concretely, For every increase of 1% in Educational Expenditure (% GDP), the forecasted Gini Index in 2073 decreases by 2. This shows how (at least according to this model), investing in Education has a big impact on Singapore's future Gini Index

Conclusion¶

In this analysis, we explored the impact of government expenditure on education on key economic indicators such as GDP growth and income inequality. The findings indicate that government investment in education plays a pivotal role in driving economic growth and reducing inequality.

First, the strong positive correlation between a country’s expenditure on education and its GDP underscores the importance of prioritizing education as a means of enhancing national economic performance. As countries grow and develop, this relationship becomes even more pronounced, suggesting that education investment should scale alongside economic progress.

Second, the analysis of income inequality, as measured by the Gini Index, reveals that higher educational expenditure per capita is associated with a reduction in inequality. Countries with the highest levels of inequality tend to invest significantly less in education, further emphasizing the role of education in promoting a more equitable society.

Finally, the predictive trends for Singapore demonstrate that increasing educational expenditure not only boosts GDP per capita but also leads to a more equitable distribution of wealth in the future. Projections for 2073 show that even modest increases in education spending could result in significant economic gains and a reduction in income inequality. These findings highlight the potential long-term benefits of education investments and the critical role they play in shaping a nation’s economic future.

While the analysis and predictions show a high degree of accuracy, it is important to recognize that long-term success will depend on consistent and targeted educational spending. Future research could explore the integration of other social factors and investigate how the quality of education impacts these economic indicators. Expanding the scope to include more countries and exploring the psychological and social benefits of education could further enhance the understanding of its comprehensive impact on well-being.

In conclusion, investing in education is not just about fostering knowledge—it is a powerful economic tool that drives growth, reduces inequality, and builds a stronger, more equitable future. Governments should consider increasing their expenditure on education to maximize these benefits.

Recommendations or Further Works¶

State any recommendations, improvements or further works.

Areas for Improvement¶

  1. Data Quality and Coverage: To make our findings more accurate, we should aim to gather more comprehensive data, especially for countries where educational and economic data are either incomplete or not widely available. By expanding the dataset to cover more countries and extending the time periods, we can strengthen the conclusions drawn from this analysis.

  2. Control for Other Variables: While education expenditure is clearly important, other factors—like political stability, health spending, and labor market policies—also play a role in affecting GDP and inequality. It would make sense to conduct a multivariate analysis that accounts for these variables to provide a clearer understanding of how education spending alone impacts these economic indicators.

  3. Country-Specific Analysis: Although global and regional trends offer valuable insights, drilling down to a more granular, country-specific level might reveal even more actionable insights for individual governments. Exploring how education quality, curriculum relevance, and infrastructure differ from country to country could help refine policy recommendations that are more targeted.

  4. Disaggregate Gini Index Data: Currently, the Gini Index is treated as an aggregate in the analysis. To get a better grasp of how education impacts inequality, further research could break down the Gini Index by different factors like age, gender, and regions within countries. This could lead to more targeted interventions, especially in places where inequality is deeply divided along these lines.

Further Research¶

  1. Investigate Long-Term Impact of Education Expenditure: It would be beneficial to explore how education spending impacts economies over a longer time frame. Longitudinal studies that track this relationship across decades could give us deeper insights into the delayed effects of education investment on both GDP and inequality. Understanding these time lags is crucial for refining long-term policy decisions.

  2. Examine the Role of Educational Quality: Beyond just spending, it’s important to consider the quality of education itself. Metrics like student performance, literacy rates, and international testing scores could shed light on how improved educational outcomes at any given level of expenditure affect GDP and inequality.

  3. Expand the Scope to Include Technological and Vocational Training: Another area worth exploring is the impact of vocational and technical education spending. As economies evolve and the demand for skilled labor in technical fields increases, it’s important to understand how this type of education affects overall economic outcomes.

  4. Cross-Compare with Other Forms of Public Expenditure: It could be insightful to compare education spending with other areas of public expenditure, such as healthcare or infrastructure, to determine which type of investment yields the best returns in terms of economic growth and reducing inequality.

  5. Forecasting Techniques: To enhance predictions of future trends, exploring machine learning models like time-series analysis or econometric modeling could provide more robust forecasts for how educational expenditure might affect GDP and the Gini Index moving forward.

References¶

A reference section citing any additional references / links you may have used for the project.
  1. https://www.influxdata.com/time-series-forecasting-methods/#:~:text=Time%20series%20forecasting%20is%20a,Astronomy
  2. https://www.tableau.com/learn/articles/time-series-forecasting
  3. https://otexts.com/fpp2/data-methods.html
  4. https://en.wikipedia.org/wiki/Gini_coefficient
  5. https://data.worldbank.org/indicator/SI.POV.GINI
  6. https://databank.worldbank.org/metadataglossary/gender-statistics/series/SI.POV.GINI
If you have used ChatGPT or any other AI tools (e.g. Bing Chat, Bard etc), state down the prompts used clearly and share the link of the relevant chat transcript (if any). Failure to declare use of AI tools or to cite it properly is an act of plagarism (unattributed use) and will face academic and disciplinary consequences.

Prompts Used¶

  1. Give guidelines on how to build a predictive model in python for predicting GDP primarily based on governmental expenditure on education. I have the following datasets.
  2. Could you build an AI model instead? Would you recommend doing so?
  3. Implement the non-AI predictive models, using the following data
  4. How do we check for multicollinearity?
  5. How would you use time-series forecasting methods instead?
  6. Give me a list of time-series forecasting methods, and their pros and cons
  7. Give me a zero to hero course on prophet
  8. Can prophet handle predicting based on.multiple variables?
  9. what was 2024 1q gdp sg in sgd?
  10. you did not import pd
  11. Give me a diverging chloropleth colour scheme
  12. Think of at least 25 possible visualisations based on the following data. Note that all the data are across countries
  13. Give me code for a treemap
  14. Please web-scrape https://www.iban.com/country-codes
  15. Is it reasonable to say that countries with lower that 5000 usd GDP per capita are underdeveloped or developing?
  16. Give me visualisations for the below data, ranging across countries and years
  17. I have a df with countries as the columns and years as the index. I also have another df of GDP of a country. I want to plot a lineplot of the entries of the first df, across time. However, I also want to color the countries in descending GDP, from red to blue. How?
  18. convert these 2 to subplots in the same figure
  19. make the below plots 2 static subplots, one for gini and one for edu_expense
  20. I want the color bars to be shown sepatately
  21. is gini coeff higher in developed or developing countries
  22. how to give chloropleth graph a title?
  23. I want you to create 2 subplots, the right one displaying the figure below and the left one having q3_df.gini_index.describe()
  24. could you use .text() for the right one?
  25. how to write 3rd markdown
  26. no i meant with the rd on top
  27. Answer the following question with the attached data. What predictive trends can be identified regarding the future impact of current education spending in Singapore on the economical well-being of its population?
  28. perhaps you are using the old file path structure?
  29. Answer the following question assuming you have the below data. What predictive trends can be identified regarding the future impact of current education spending in Singapore on the economical well-being of its population?
  30. Using this dataset, create an mlm model to predict the gdp of singapore in the times to come
  31. Rewrite the below
  32. Rewrite the below.
  33. Rewrite the below, elaborating and justifying
  34. Add some text saying "Mean Overall Population Growth" and "Mean Population Growth in the last decade"
  35. what are all the libraries required for these. annwer in the form ! pip3 install ..
  36. can you give them in separate lines? and use ! pip install..
  37. Give an overview of what these indicators mean We will be using the following datasets for this question
  38. make far more brief
  39. For the following prompts, I will give you some code. DO NOT CHANGE THE CODE ITSELF, but add appropriate comments everywhere.

Sharing link of entire chat transcript¶

https://chatgpt.com/share/66e6f45d-5834-8008-a441-016bf8c2d538
https://chatgpt.com/share/66fbc50d-e2f0-8008-b26a-3085c95d3e1e
https://chatgpt.com/share/66fbe301-36d0-8000-859f-710ec27e185a
https://chatgpt.com/share/66fbfd1a-d330-8000-9f05-d28a7c9bdf41